Twitter details how new home-grown system coordinates data analytics

Twitter has unveiled its new real-time analytics orchestration system, dubbed TSAR (TimeSeries AggregatoR), that will help ease the burden of its engineers. The framework was designed for the purpose of aggregating and automating all the data gathering and calculations of its various analytic systems like Hadoop and Storm into one common framework. With TSAR, as detailed in a blog post, Twitter engineers won’t have to worry about coordinating the actions of each analytics tool as well as allow for different product teams to innovate upon the platform.

While data-management systems typically deal with the problem of sorting out information, they all think about data in different ways. Whereas MySQL stores data in a relational data store, Hadoop, which does batch processing, takes another view of the same set of data so that it can do its own calculations, explained Aaron Siegel, one of the key engineers responsible for TSAR’s development. What TSAR does is coordinate how each one of these systems talk to each other so that if a product team wanted a specific bit of information, TSAR could do the heavy lifting of having to figure out which data system needs to do what.

“We want to enable product teams to focus on building their products,” said Siegel. “They don’t have to think ‘Where is my data coming from?’ They can just build great products and have the data be available.”

Diagram of TSAR and how it fits into Twitter's infrastructure

Diagram of TSAR and how it fits into Twitter’s infrastructure

Before TSAR, it was an operational nightmare to have to manually orchestrate all the various Hadoop and Storm jobs in addition to other data processes, said staff software engineer Reza Lotun. Essentially, every time a product team wanted certain types of insight into Twitter’s vast amounts of data, the engineering team would have to create a custom analytics pipeline to deliver what the product team was looking for. TSAR’s ability to automate these tasks dramatically cuts down on this effort.

Here’s a sample of what the engineering team had to do before TSAR, as detailed in the blog post:

In addition to simply writing the business logic of the impression counts job, one has to build infrastructure to deploy the Hadoop and Storm jobs, build a query service that combines the results of the two pipelines, deploy a process to load data into Manhattan/MySQL etc. A production pipeline requires monitoring and alerting around its various components and we also want checks for data quality.

The TSAR framework was built on top of the Summingbird system, said software engineer and blog post author Anirudh Todi. While Summingbird provided the computational framework that was essentially a hybrid system in which the batch processing power of Hadoop could function together with the stream processing functionality of Storm, TSAR improves upon the system by making it easier for all the different analytics platforms to communicate back and forth.

From the blog post:

TSAR infers and defines the key-value pair data models and relational database schema descriptions automatically via a configuration file and a job specific thrift struct provided by the user. TSAR automates Twitter best practices using a general-purpose reusable aggregation framework. Note that TSAR is not tied to any specific sink. Sinks can easily be added to TSAR by the user and TSAR will transparently begin persisting aggregated data to these sinks.

Currently, a Twitter engineering team is working on a real-time visualization tool to be built on top of TSAR, said Siegel, and other products are on the horizon.

Post and thumbnail images courtesy of Shutterstock user Carlos Amarillo.