How Twitter processes tons of mobile application data each day

It’s only been seven months since Twitter released its Answers tool, which was designed to provide users with mobile application analytics. But since that time, Twitter now sees roughly five billion daily sessions in which “hundreds of millions of devices send millions of events every second to the Answers endpoint,” the company explained in a blog post on Tuesday. Clearly, that’s a lot of data that needs to get processed and in the blog post, Twitter detailed how it configured its architecture to handle the task.

The backbone of Answers was created to handle how the mobile application data is received, archived, processed in real time and processed in chunks (otherwise known as batch processing).

Each time an organization uses the Answer tool to learn more how his or her mobile app is functioning, Twitter logs and compresses all that data (which gets set in batches) in order to conserve the device’s battery power while also not putting too much unnecessary strain on the network that routes the data from the device to Twitter’s servers.

The information flows into a Kafka queue, which Twitter said can be used as a temporary place to store data. The data then gets passed into Amazon Simple Storage Service (Amazon S3) where Twitter retains the data in a more permanent location as opposed to Kafka. Twitter uses Storm to process the data that flows into Kafka and also uses it to write the information stored in Kafka to [company]Amazon[/company] S3.

Data pipeline

Data pipeline

With the data stored in Amazon S3, Twitter than uses Amazon Elastic MapReduce for batch processing.

From the blog post:
[blockquote person=”Twitter” attribution=”Twitter”]We write our MapReduce in Cascading and run them via Amazon EMR. Amazon EMR reads the data that we’ve archived in Amazon S3 as input and writes the results back out to Amazon S3 once processing is complete. We detect the jobs’ completion via a scheduler topology running in Storm and pump the output from Amazon S3 into a Cassandra cluster in order to make it available for sub-second API querying.[/blockquote]

At the same time as this batch processing is going on, Twitter is also processing data in real time because “some computations run hourly, while others require a full day’s of data as input,” it said. In order to address the computations that need to be performed more quickly and require less data that the bigger batch processing jobs, Twitter uses a instance of Storm that processes the data that’s sitting in Kafka, the results of which get funneled into an independent Cassandra cluster for real-time querying.

From the blog post:
[blockquote person=”Twitter” attribution=”Twitter”]To compensate for the fact that we have less time, and potentially fewer resources, in the speed layer than the batch, we use probabilistic algorithms like Bloom Filters and HyperLogLog (as well as a few home grown ones). These algorithms enable us to make order-of-magnitude gains in space and time complexity over their brute force alternatives, at the price of a negligible loss of accuracy.[/blockquote]

The complete data-processing system looks like this, and it’s tethered together with Twitter’s APIs:

Twitter Answers architecture

Twitter Answers architecture

Because of the way the system is architected and the fact that the data that needs to be analyzed in real time is separated from the historical data, Twitter said that no data will be lost if something goes wrong during the real-time processing. All that data is stored where Twitter does its batch processing.

If there are problems affecting batch processing, Twitter said its APIs “will seamlessly query for more data from the speed layer” and can essentially configure the the system to take in “two or three days of data” instead of just one day; this should give Twitter engineers enough time to take a look at what went wrong while still providing users with the type of analytics derived from batch processing.

DataStax’s first acquisition is a graph-database company

DataStax, the rising NoSQL database vendor that hawks a commercial version of the open-source Apache Cassandra distributed database, plans to announce on Tuesday that it has acquired graph-database specialist Aurelius, which maintains the open-source graph database Titan.

All of Aurelius’s eight-person engineering staff will be joining DataStax, said Martin Van Ryswyk, DataStax’s executive vice president of engineering. This makes for DataStax’s first acquisition since being founded in 2010. The company did not disclose the purchase price, but Van Ryswyk said that a “big chunk” of DataStax’s recent $106 million funding round was used to help finance the purchase.

Although DataStax has been making a name for itself amid the NoSQL market, where it competes with companies like MongoDB and Couchbase, it’s apparent that the company is branching out a little bit by purchasing a graph-database shop.

Cassandra is a powerful and scalable database used for online or transactional purposes (Netflix and Spotify are users), but it lacks some of the features that make graph databases attractive for some organizations, explained DataStax co-founder and chief customer officer Matt Pfeil. These features include the ability to map out relationships between data points, which is helpful for social networks like Pinterest or [company]Facebook[/company] who use graph architecture to learn about user interests and activities.

Financial institutions are also interested in graph databases as a way to detect fraud and malicious behavior in their infrastructure, Pfeil said.

As DataStax “started to move up the stack,” the company noticed that its customers were using graph database technology, and DataStax felt it could come up with a product that could give customers what they wanted, said Pfeil.

DataStax Enterprise

DataStax Enterprise

Customers don’t just want one database technology, they want a “multi-dimensional approach” that includes Cassandra, search capabilities, analytics and graph technology, and they are willing to plunk down cash for commercial support, explained Van Ryswyk.

Because some open-source developers were already figuring out ways for both Cassandra and the Titan database to be used together, it made sense that DataStax and the Aurelius team to work together on making the enterprise versions of the technology compatible with each other, Van Ryswyk said.

Together, DataStax and the newly acquired Aurelius team will develop a commercial graph product called DataStax Enterprise (DSE) Graph, which they will try to “get it to the level of scalability that people expect of Cassandra,” said Van Ryswyk. As of now, there is no release date as to when the technology will be ready, but Pfeil said work on the new product is already taking place.

If you’re interested in learning more about what’s going on with big data in the enterprise and what other innovative companies are doing, you’ll want to check out this year’s Structure Data conference from March 18-19 in New York City.

Mesosphere’s new data center mother brain will blow your mind

Mesosphere has been making a name for itself in the the world of data centers and cloud computing since 2013 with its distributed-system smarts and various introductions of open-source technologies, each designed to tackle the challenges of running tons of workloads across multiple machines. On Monday, the startup plans to announce that its much-anticipated data center operating system — the culmination of its many technologies — has been released as a private beta and will be available to the public in early 2015.

As part of the new operating system’s launch, [company]Mesosphere[/company] also plans to announce that it has raised a $36 million Series B investment round, which brings its total funding to $50 million. Khosla Ventures, a new investor, drove the financing along with Andreessen Horowitz, Fuel Capital, SV Angel and other unnamed entities.

Mesosphere’s new data center operating system, dubbed DCOS, tackles the complexity behind trying to read all of the machines inside a data center as one giant computer. Similar to how an operating system on a personal computer can distribute the necessary resources to all the installed applications, DCOS can supposedly do the same thing across the data center.

The idea comes from the fact that today’s powerful data-crunching applications and services — like Kafka, Spark and Cassandra — span multiple servers, unlike more old-school applications like [company]Microsoft[/company] Excel. Asking developers and operations staff to configure and maintain each individual machine to accommodate the new distributed applications is quite a lot, as Apache Mesos co-creator and new Mesosphere hire Benjamin Hindman explained in an essay earlier this week.

Mesosphere CEO Florian Leibert

Mesosphere CEO Florian Leibert – Source: Mesosphere

Because of this complexity, the machines are nowhere near running full steam, said Mesosphere’s senior vice president of marketing and business development Matt Trifiro.

“85 percent of a data center’s capacity is typically wasted,” said Trifiro. Although developers and operations staff have come a long way to tether pieces of the underlying system together, there hasn’t yet been a nucleus of sorts that successfully links and controls everything.

“We’ve always been talking about it — this vision,” said Mesosphere CEO Florian Leibert. “Slowly but surely the pieces came together; now is the first time we are showing the total picture.”

Building an OS

The new DCOS is essentially a bundle of all of the components Mesosphere has been rolling out — including the Mesos resource management system, the Marathon framework and Chronos job scheduler — as well as third-party applications like the Hadoop file system and YARN.

The DCOS also includes common OS features one would would find in Linux or Windows, like a graphical user interface, command-line interface and a software-development kit.

These types of interfaces and extras are important for DCOS to be a true operating system, explained Leibert. While Mesos can automate the allocation of all the data center resources to many applications, the additional features provide coders and operations staff a centralized hub from which they can monitor their data center as a whole and even program.

“We took the core [Mesos] kernel and built the consumable systems around it,” said Trifiro. “[We] added Marathon, added Chronos and added the easy install of the entire package.”

To get DCOS up and running in a data center, Mesosphere installs a small agent on all Linux OS-based machines, which in turn allows them to be read as an “uber operating system,” explained Leibert. With all of the machines’ operating systems linked up, it’s supposedly easier for distributed applications, like Google’s Kubernetes, to function and receive what they needs.

The new graphical interface and command-line interface allows an organization to see a visual representation of all of their data center machines, all the installed distributed applications and how system resources like CPU and memory are being shared.

If a developer wants to install an application in the data center, he or she simply has to enter install commands in the command-line interface and the DCOS should automatically load it up. A visual representation of the app should then appear along with indicating which machine nodes are allocating the right resources.

DCOS interface

DCOS interface

The same process goes for installing a distributed database like Cassandra; you can now “have it running in a minute or so,” said Leibert.

Installing Cassandra on DCOS

Installing Cassandra on DCOS

A scheduler is built into DCOS that takes in account certain variables a developer might want to include in order to decide which machine should deliver resources to what application; this is helpful because it allows the developer to set up the configurations and the DCOS will automatically follow through with the orders.

“We basically turn the software developer into a data center programmer,” said Leibert.

And because DCOS makes it easier for a coder to program against, it’s possible that new distributed applications could be made faster than before because the developer can now write software to a fleet of machines rather than only one.

As of today, DCOS can run on on-premise environments like bare metal and OpenStack, major cloud providers — like [company]Amazon[/company], [company]Google[/company] and [company]Microsoft[/company] — and it supports Linux variants like CoreOS and Redhat.

Changing the notion of a data center

Leibert wouldn’t name which organizations are currently trying out DCOS in beta, but it’s hard not to think that companies like Twitter, Netflix or Airbnb — all users of Mesos — haven’t considered giving it a test drive. Leibert was a former engineer at Twitter and Airbnb, after all.

Beyond the top webscale companies, Mesosphere wants to court legacy enterprises like those in the financial-services industry who have existing data centers that aren’t nearly as efficient as those seen at Google.

Banks, for example, typically use “tens of thousands of machines” in their data centers to perform risk analysis, Leibert said. With DCOS, Leibert claims that banks can run the type of complex workloads they require in a more streamlined manner if they were to link up all those machines.

And for these companies that are under tight regulation, Leibert said that Mesosphere has taken security into account.

“We built a security product into this operating system that is above and beyond any open-source system, even as a commercial plugin,” said Leibert.

As for what lies ahead for DCOS, Leibert said that his team is working on new features like distributed checkpointing, which is basically the ability to take a snapshot of a running application so that you can pause your work; the next time you start it up, the data center remembers where it left off and can deliver the right resources as if there wasn’t a break. This method is apparently good for developers working on activities like genome sequencing, he said.

Support for containers is also something Mesosphere will continue to tout, as the startup has been a believer in the technology “even before the hype of [company]Docker[/company],” said Leibert. Containers, with their ability to isolate workloads even on the same machine, are fundamental to DCOS, he said.

Mesosphere believes there will be new container technology emerging, not just the recently announced CoreOS Rocket container technology, explained Trifiro, but as of now, Docker and native Linux cgroup containers are what customers are calling for. If Rocket gains momentum in the market place, Trifiro said, Mesosphere will “absolutely implement it.”

If DCOS ultimately lives up to what it promises it can deliver, managing data centers could be a way less difficult task. With a giant pool of resources at your disposal and an easier way to write new applications to a tethered-together cluster of computers, it’s possible that next-generation applications could be developed and managed far easier than they use to be.

Correction: This post was updated at 8:30 a.m. to correctly state Leibert’s previous employers. He worked at Airbnb, not Netflix.