Cloudera tunes Google’s Dataflow to run on Spark

Hadoop software company Cloudera has worked with Google to make Google’s Dataflow programming model run on Apache Spark. Dataflow, which Google announced as a cloud service in June, lets programmers write the same code for both batch- and stream-processing tasks. Spark is becoming a popular environment for running both types of tasks, even in Hadoop environments.

Cloudera has open sourced the code as part of its Cloudera Labs program. [company]Google[/company] had previously open sourced the Dataflow SDK that Cloudera used to carry out this work.

Cloudera’s Josh Wills explains the promise of Dataflow like this in a Tuesday-morning blog post:

[T]he streaming execution engine has strong consistency guarantees and provides a windowing model that is even more advanced than the one in Spark Streaming, but there is still a distinct batch execution engine that is capable of performing additional optimizations to pipelines that do not process streaming data. Crucially, the client API for the batch and the stream processing engines are identical, so that any operation that can be performed in one context can also be performed in the other, and moving a pipeline from batch mode to streaming mode should be as seamless as possible.

Essentially, Dataflow should make it easier to build reliable big data pipelines than previous architectural models, which often involved managing Hadoop MapReduce for batch processing and something like Storm for stream processing. Running Dataflow on Spark means there is a single set of APIs and a single processing engine, which happens to be significantly faster than MapReduce for most jobs.

While Dataflow on Spark won’t end problems of complexity, speed or scale, it’s another step along a path that has resulted in faster, easier big data technologies at every turn. Hadoop was too slow for a lot of applications and too complicated for a lot of users, but those barriers to use are falling away fast.

If you want to hear more about where the data infrastructure space is headed, come to our Structure Data conference March 18-19 in New York. Speakers include Eric Brewer, vice president of infrastructure at Google; Ion Stoica, co-creator of Apache Spark and CEO of [company]Databricks[/company]; and Tom Reilly, Rob Bearden and John Schroeder, the CEOs of Cloudera, [company]Hortonworks[/company] and MapR, respectively.

Google adds a big data service and lots of monitoring to its cloud

Google rolled out a slew of new cloud services at I/O, including one called Dataflow that’s meant to put standard MapReduce to shame. It’s advertised a much simpler way to build data pipelines that can handle both batch processing and streaming data.

How dataflow powered Mozilla’s real-time download counter

When Mozilla released Firefox 4 in early 2011, it displayed a real-time download counter on its website. How could the site process that much data that quickly? SQLstream CEO Damian Black explained at GigaOM’s Structure conference Thursday that the secret was dataflow.