How Twitter processes tons of mobile application data each day

It’s only been seven months since Twitter released its Answers tool, which was designed to provide users with mobile application analytics. But since that time, Twitter now sees roughly five billion daily sessions in which “hundreds of millions of devices send millions of events every second to the Answers endpoint,” the company explained in a blog post on Tuesday. Clearly, that’s a lot of data that needs to get processed and in the blog post, Twitter detailed how it configured its architecture to handle the task.

The backbone of Answers was created to handle how the mobile application data is received, archived, processed in real time and processed in chunks (otherwise known as batch processing).

Each time an organization uses the Answer tool to learn more how his or her mobile app is functioning, Twitter logs and compresses all that data (which gets set in batches) in order to conserve the device’s battery power while also not putting too much unnecessary strain on the network that routes the data from the device to Twitter’s servers.

The information flows into a Kafka queue, which Twitter said can be used as a temporary place to store data. The data then gets passed into Amazon Simple Storage Service (Amazon S3) where Twitter retains the data in a more permanent location as opposed to Kafka. Twitter uses Storm to process the data that flows into Kafka and also uses it to write the information stored in Kafka to [company]Amazon[/company] S3.

Data pipeline

Data pipeline

With the data stored in Amazon S3, Twitter than uses Amazon Elastic MapReduce for batch processing.

From the blog post:
[blockquote person=”Twitter” attribution=”Twitter”]We write our MapReduce in Cascading and run them via Amazon EMR. Amazon EMR reads the data that we’ve archived in Amazon S3 as input and writes the results back out to Amazon S3 once processing is complete. We detect the jobs’ completion via a scheduler topology running in Storm and pump the output from Amazon S3 into a Cassandra cluster in order to make it available for sub-second API querying.[/blockquote]

At the same time as this batch processing is going on, Twitter is also processing data in real time because “some computations run hourly, while others require a full day’s of data as input,” it said. In order to address the computations that need to be performed more quickly and require less data that the bigger batch processing jobs, Twitter uses a instance of Storm that processes the data that’s sitting in Kafka, the results of which get funneled into an independent Cassandra cluster for real-time querying.

From the blog post:
[blockquote person=”Twitter” attribution=”Twitter”]To compensate for the fact that we have less time, and potentially fewer resources, in the speed layer than the batch, we use probabilistic algorithms like Bloom Filters and HyperLogLog (as well as a few home grown ones). These algorithms enable us to make order-of-magnitude gains in space and time complexity over their brute force alternatives, at the price of a negligible loss of accuracy.[/blockquote]

The complete data-processing system looks like this, and it’s tethered together with Twitter’s APIs:

Twitter Answers architecture

Twitter Answers architecture

Because of the way the system is architected and the fact that the data that needs to be analyzed in real time is separated from the historical data, Twitter said that no data will be lost if something goes wrong during the real-time processing. All that data is stored where Twitter does its batch processing.

If there are problems affecting batch processing, Twitter said its APIs “will seamlessly query for more data from the speed layer” and can essentially configure the the system to take in “two or three days of data” instead of just one day; this should give Twitter engineers enough time to take a look at what went wrong while still providing users with the type of analytics derived from batch processing.

Cloudera tunes Google’s Dataflow to run on Spark

Hadoop software company Cloudera has worked with Google to make Google’s Dataflow programming model run on Apache Spark. Dataflow, which Google announced as a cloud service in June, lets programmers write the same code for both batch- and stream-processing tasks. Spark is becoming a popular environment for running both types of tasks, even in Hadoop environments.

Cloudera has open sourced the code as part of its Cloudera Labs program. [company]Google[/company] had previously open sourced the Dataflow SDK that Cloudera used to carry out this work.

Cloudera’s Josh Wills explains the promise of Dataflow like this in a Tuesday-morning blog post:

[T]he streaming execution engine has strong consistency guarantees and provides a windowing model that is even more advanced than the one in Spark Streaming, but there is still a distinct batch execution engine that is capable of performing additional optimizations to pipelines that do not process streaming data. Crucially, the client API for the batch and the stream processing engines are identical, so that any operation that can be performed in one context can also be performed in the other, and moving a pipeline from batch mode to streaming mode should be as seamless as possible.

Essentially, Dataflow should make it easier to build reliable big data pipelines than previous architectural models, which often involved managing Hadoop MapReduce for batch processing and something like Storm for stream processing. Running Dataflow on Spark means there is a single set of APIs and a single processing engine, which happens to be significantly faster than MapReduce for most jobs.

While Dataflow on Spark won’t end problems of complexity, speed or scale, it’s another step along a path that has resulted in faster, easier big data technologies at every turn. Hadoop was too slow for a lot of applications and too complicated for a lot of users, but those barriers to use are falling away fast.

If you want to hear more about where the data infrastructure space is headed, come to our Structure Data conference March 18-19 in New York. Speakers include Eric Brewer, vice president of infrastructure at Google; Ion Stoica, co-creator of Apache Spark and CEO of [company]Databricks[/company]; and Tom Reilly, Rob Bearden and John Schroeder, the CEOs of Cloudera, [company]Hortonworks[/company] and MapR, respectively.

WibiData’s open source HBase project now supports real-time predictions

Hadoop startup WibiData has updated Kiji, its open source project that aims to make HBase a better (or easier) database for serving real-time applications. Among the updates in its latest SDK is an improved version of the KijiScoring feature. “Developers can now pass per-request settings to producer functions, greatly expanding the flexibility of real-time predictive model scoring. For example, a user’s current geolocation from mobile application can be factored in when re-computing which offers or recommendations to serve a user,” explains a press release.

Twitter buys BackType to dig deeper with big data

Twitter announced Tuesday it has acquired BackType, an analytics platform aimed at helping companies and brands gauge their social media impact. The possible rationale for the deal is BackType’s Storm real-time big data processing platform that could help Twitter offer well-defined analytics.