Hadoop software company Cloudera has worked with Google to make Google’s Dataflow programming model run on Apache Spark. Dataflow, which Google announced as a cloud service in June, lets programmers write the same code for both batch- and stream-processing tasks. Spark is becoming a popular environment for running both types of tasks, even in Hadoop environments.
Cloudera has open sourced the code as part of its Cloudera Labs program. [company]Google[/company] had previously open sourced the Dataflow SDK that Cloudera used to carry out this work.
Cloudera’s Josh Wills explains the promise of Dataflow like this in a Tuesday-morning blog post:
[T]he streaming execution engine has strong consistency guarantees and provides a windowing model that is even more advanced than the one in Spark Streaming, but there is still a distinct batch execution engine that is capable of performing additional optimizations to pipelines that do not process streaming data. Crucially, the client API for the batch and the stream processing engines are identical, so that any operation that can be performed in one context can also be performed in the other, and moving a pipeline from batch mode to streaming mode should be as seamless as possible.
Essentially, Dataflow should make it easier to build reliable big data pipelines than previous architectural models, which often involved managing Hadoop MapReduce for batch processing and something like Storm for stream processing. Running Dataflow on Spark means there is a single set of APIs and a single processing engine, which happens to be significantly faster than MapReduce for most jobs.
While Dataflow on Spark won’t end problems of complexity, speed or scale, it’s another step along a path that has resulted in faster, easier big data technologies at every turn. Hadoop was too slow for a lot of applications and too complicated for a lot of users, but those barriers to use are falling away fast.
If you want to hear more about where the data infrastructure space is headed, come to our Structure Data conference March 18-19 in New York. Speakers include Eric Brewer, vice president of infrastructure at Google; Ion Stoica, co-creator of Apache Spark and CEO of [company]Databricks[/company]; and Tom Reilly, Rob Bearden and John Schroeder, the CEOs of Cloudera, [company]Hortonworks[/company] and MapR, respectively.