Twitter has open sourced a system that aims to mitigate the tradeoffs between batch processing and stream processing by combining them into a hybrid system. In the case of Twitter, Hadoop handles batch processing, Storm handles stream processing, and the hybrid system is called Summingbird. It’s not a tool for every job, but it sounds pretty handy for those it’s designed to address.
Twitter’s blog post announcing Summingbird is pretty technical, but the problem is pretty easy to understand if you think about how Twitter works. Services like Trending Topics and search require real-time processing of data to be useful, but they eventually need to be accurate and probably analyzed a little more thoroughly. Storm is like a hospital’s triage unit, while Hadoop is like longer-term patient care.
This description of Summingbird from the project’s wiki does a pretty good job of explaining how it works at a high level. The implementation is a little more complex, of course:
The hybrid model allows most data to be processed by Hadoop and served out of a read-only store like Manhattan. Only data that Hadoop hasn’t yet been able to process, data that falls within the latency window, would be served out of a datastore populated in realtime by Storm. The error of the realtime layer is bounded, as Hadoop will eventually get around to processing the same data and smoothing out any error introduced.
Hybrid systems like this are actually becoming more common as companies realize they can’t survive in a real-time world with Hadoop alone. We’ve covered systems at numerous companies — Gravity, LinkedIn and Netflix among them — that aim to do something similar. Summingbird might be different in that it’s a hybrid system handling data from both Hadoop and Storm, as opposed to a pipeline of different systems, but web companies need some way to ensure they’re not trading off speed for accuracy, or vice versa.
We won’t have anyone from Twitter at Structure: Europe (Sept. 18 and 19 in London) to talk about Summingbird specifically, but our data lineup is pretty impressive and can probably speak in depth about why it’s important. They come from places like PayPal, MailChimp and LinkedIn, as well as entrepreneurs with previous experience at places like Yahoo and the NSA.
For a little more on Summingbird, which Twitter actually describes as “streaming MapReduce” because of its focus on aggregation jobs, check out this presentation that Twitter’s Sam Ritchie (who also wrote the blog post) gave in June. It also might be worth checking out Yahoo’s open source Storm-YARN project for actually running Storm within Hadoop clusters in order to give Storm access to Hadoop-based data stores.