How Twitter processes tons of mobile application data each day

It’s only been seven months since Twitter released its Answers tool, which was designed to provide users with mobile application analytics. But since that time, Twitter now sees roughly five billion daily sessions in which “hundreds of millions of devices send millions of events every second to the Answers endpoint,” the company explained in a blog post on Tuesday. Clearly, that’s a lot of data that needs to get processed and in the blog post, Twitter detailed how it configured its architecture to handle the task.

The backbone of Answers was created to handle how the mobile application data is received, archived, processed in real time and processed in chunks (otherwise known as batch processing).

Each time an organization uses the Answer tool to learn more how his or her mobile app is functioning, Twitter logs and compresses all that data (which gets set in batches) in order to conserve the device’s battery power while also not putting too much unnecessary strain on the network that routes the data from the device to Twitter’s servers.

The information flows into a Kafka queue, which Twitter said can be used as a temporary place to store data. The data then gets passed into Amazon Simple Storage Service (Amazon S3) where Twitter retains the data in a more permanent location as opposed to Kafka. Twitter uses Storm to process the data that flows into Kafka and also uses it to write the information stored in Kafka to [company]Amazon[/company] S3.

Data pipeline

Data pipeline

With the data stored in Amazon S3, Twitter than uses Amazon Elastic MapReduce for batch processing.

From the blog post:
[blockquote person=”Twitter” attribution=”Twitter”]We write our MapReduce in Cascading and run them via Amazon EMR. Amazon EMR reads the data that we’ve archived in Amazon S3 as input and writes the results back out to Amazon S3 once processing is complete. We detect the jobs’ completion via a scheduler topology running in Storm and pump the output from Amazon S3 into a Cassandra cluster in order to make it available for sub-second API querying.[/blockquote]

At the same time as this batch processing is going on, Twitter is also processing data in real time because “some computations run hourly, while others require a full day’s of data as input,” it said. In order to address the computations that need to be performed more quickly and require less data that the bigger batch processing jobs, Twitter uses a instance of Storm that processes the data that’s sitting in Kafka, the results of which get funneled into an independent Cassandra cluster for real-time querying.

From the blog post:
[blockquote person=”Twitter” attribution=”Twitter”]To compensate for the fact that we have less time, and potentially fewer resources, in the speed layer than the batch, we use probabilistic algorithms like Bloom Filters and HyperLogLog (as well as a few home grown ones). These algorithms enable us to make order-of-magnitude gains in space and time complexity over their brute force alternatives, at the price of a negligible loss of accuracy.[/blockquote]

The complete data-processing system looks like this, and it’s tethered together with Twitter’s APIs:

Twitter Answers architecture

Twitter Answers architecture

Because of the way the system is architected and the fact that the data that needs to be analyzed in real time is separated from the historical data, Twitter said that no data will be lost if something goes wrong during the real-time processing. All that data is stored where Twitter does its batch processing.

If there are problems affecting batch processing, Twitter said its APIs “will seamlessly query for more data from the speed layer” and can essentially configure the the system to take in “two or three days of data” instead of just one day; this should give Twitter engineers enough time to take a look at what went wrong while still providing users with the type of analytics derived from batch processing.

IBM to build two supercomputers for the U.S. Department of Energy

IBM said today that it will develop two new supercomputers for the U.S. Department of Energy that are based on IBM’s new Power servers and will contain NVIDIA GPU accelerators and Mellanox networking technology. The new supercomputers, to be named Summit and Sierra, will be ready to roll in 2017; IBM will end up scoring a cool $325 million in government contracts.

Holy cow, the Formula 1 races have a ton of tech inside

Formula 1 racing has returned to America with last weekend’s race in Austin, Texas. And with it came a jumbo jet packed full of 160 tons of IT and broadcasting equipment and F1’s amazing traveling IT staff. Learn more about the tech powering the sport.

Spurned by VCs, a chip startup turns to Kickstarter

It’s hard for chip startups to raise funding, but the demands of mobile and cloud computing are providing a window of opportunity for all kinds of innovative silicon-based designs. Thus, when Adapteva couldn’t find a VC backer, its CEO turned instead to Kickstarter.

Apple code reveals quad-core iPhones, iPads could come soon

Apple might have quad-core iPhone and iPad devices coming in 2012, according to some code discovered deep in Apple’s iOS 5.1 pre-release software on Friday. This discovery adds fuel to the fire surrounding rumors the iPad 3 will boast a quad-core A6 processor.

3 startups that showcase the future of chips

If we’re going to create an Internet of things that connects back to a cloud powered by millions of servers, the chip world will have to change to reduce power consumption, shrink in size and embrace new architectures. Here are three startups that showcase these shifts.

Surprise! First Dual-Core Smartphone Arrives Early

Although smartphones and tablets with dual-core CPUs are on tap for next year, LG will offer one this year. The company announced the Optimus 2X with Nvidia’s Tegra 2 chip, which will boost overall performance, bring faster webpage loads and offer 1080p video recording and playback.

MapReduce vs. SQL: It’s Not One or the Other

A study released today by a team of leading database experts, among them Structure 09 speaker Michael Stonebraker, has been generating buzz for its assertion that clustered SQL database management systems (DBMS) actually perform significantly better for most tasks than does cloud golden child MapReduce. But how shocked should we be, really? After all, choosing a parallel data strategy is not an all-or-nothing proposition. Read More about MapReduce vs. SQL: It’s Not One or the Other

Parallel Programming in the Age of Big Data

We’re now entering what I call the “Industrial Revolution of Data,” where the majority of data will be stamped out by machines: software logs, cameras, microphones, RFID readers, wireless sensor networks and so on. These machines generate data a lot faster than people can, and their production rates will grow exponentially with Moore’s Law. Storing this data is cheap, and it can be mined for valuable information.
In this context, there is some good news for parallel programming. Data analysis software parallelizes fairly naturally. In fact, software written in SQL has been running in parallel for more than 20 years. But with “Big Data” now becoming a reality, more programmers are interested in building programs on the parallel model — and they often find SQL an unfamiliar and restrictive way to wrangle data and write code. The biggest game-changer to come along is MapReduce, the parallel programming framework that has gained prominence thanks to its use at web search companies. Read More about Parallel Programming in the Age of Big Data