LinkedIn open sources stream-processing engine Samza, its take on Storm

LinkedIn(s lnkd) has open sourced a technology called Samza, which the company uses to process data in real time. It sounds an awful lot like Storm — the de facto stream-processing technology for web properties that has a home inside Twitter — only Samza is built on top of Hadoop and utilizes LinkedIn’s homemade Kafka messaging system.

But Storm and Samza are rather similar. As LinkedIn’s Chris Riccomini wrote in the blog post introducing Samza, “[It] helps you build applications that process feeds of messages—update databases, compute counts and other aggregations, transform messages, and a lot more.” Those are some classic Storm application and, indeed, the Samza documentation includes a page dedicated to comparing the two systems.

When LinkedIn was spreading the word of Samza through various forums and other online communities last month, one commenter on Grokbase noted the possible benefits of Samza:

“Like many we use Storm for near real-time processing our Kafka based streams. In addition we send this data to Hadoop for offline analysis. Consolidating these three environments to one is a win by itself.”

On paper, Samza seems like a good idea because of that consolidation and because of how well it appears to marry its two big components. Its Apache Software Foundation project home page lays out some of its features and highlights how Kafka and YARN (the processing framework on top of which Hadoop version 2.0 is built) work together. Among the highlights are:

  • Fault tolerance: Samza will work with YARN to restart your stream processor if there is a machine or processor failure.
  • Durability: Samza uses Kafka to guarantee that messages will be processed in the order they were written to a partition, and that no messages will ever be lost.
  • Scalability: Samza is partitioned and distributed at every level. Kafka provides ordered, partitioned, re-playable, fault-tolerant streams. YARN provides a distributed environment for Samza containers to run in.

A high-level look at what Samza does

Whether Samza will ultimately attract the same amount of community involvement and innovation as Storm remains to be seen. LinkedIn — one of the better data engineering shops around — will certainly keep working on Samza just as Twitter continues to develop Storm, but the latter’s headstart in terms of availability and its ability to run atop either the YARN or Mesos frameworks does make it a little more flexible.

If Samza is indicative of a bigger picture, though, it’s that YARN appears to be living up to all the hype the Hadoop community has been heaping upon it for the past 18 months. It runs Storm, it runs Samza, it could potentially run a lot of things. This matters because a lot of software vendors big, small and in between have banked their big data futures (some, their entire future) on Hadoop as the platform that will ultimately carry the day.

The technology’s previous reliance on MapReduce limited its applicability and opened it up to criticism, but YARN is already opening up Hadoop to some very large-scale stream processing, interactive SQL query, machine learning and graph processing workloads. With each passing day, the idea of Hadoop as the platform underpinning entire libraries of big data applications seems more realistic.

YARN (2)

But you don’t have to take my word for any of this. If you’re in London, swing by our Structure: Europe conference taking place Wednesday and Thursday, or watch the live stream wherever you are. We’ll have LinkedIn’s data engineering boss Bhaskar Ghosh, among a slew of other speakers responsible at one point or another for managing some of the biggest systems on the web. We’ll also have technology executives from international corporations such as BMW. And I’m sure they all have an opinion on big data and the right technologies for doing it.

Structure Europe in article

Feature image courtesy of Shutterstock user SP-Photo.