Spark is a really big deal for big data, and Cloudera gets it

One of the most-important pieces of news to come out of this year’s Hadoop World conference might be Cloudera’s decision to provide full enterprise-grade support — like paid product support and inclusion in Cloudera’s Hadoop distribution, not just a technological integration — for Apache Spark. It’s further proof that the Hadoop workload of the future could very well look very different than the Hadoop workloads of the past and present.

Spark is an in-memory data-processing platform that is compatible with Hadoop data sources but runs much faster than Hadoop MapReduce. It’s particularly well suited for machine learning jobs, as well as interactive data queries, and is easier for many developers because it includes APIs in Scala, Python and Java. Spark is in use at a number of large web companies and web startups already, and a startup called Databricks that aims to commercialize Spark recently launched with $14 million in venture capital in its pocket.

In fact, Databricks — the first partner in the Cloudera Connect: Innovators program — will help provide Cloudera’s support and will also work with the larger Hadoop vendor on future development of the Apache Spark software, Cloudera Co-founder and CTO Amr Awadallah told me. Awadallah described the partnership — and presumably all future Innovator partnerships — as one centered around a core big data technology with a growing community and interest from Cloudera’s customers. Databricks gets paid as part of the arrangement (“It’s kind of like an OEM relationship,” Awadallah said) but the deal only extends to Apache Spark and not any commercial version Databricks might put out.

You can get more details on the partnership, as well as Databricks’ plans, on the company’s blog.

Beyond the Databricks partnership, though, the mere idea of Spark support — and Hortonworks’ quest to make the Storm stream-processing engine enterprise-ready — is important because it’s an acknowledgement that Hadoop, as a platform for running MapReduce, was never the recipe for long-term success. Now that the YARN resource-management layer has been achieved production-ready status, Cloudera and Hortonworks aren’t wasting any time making Hadoop ready for the workloads of the future. Frankly, I don’t see how Hadoop doubters over the past few years can maintain that stance in the face of Hadoop essentially becoming a scalable, open-source data layer that can support, in theory, any type of processing you can throw on top of it.

And although these efforts aren’t exactly what I had in mind when I wrote recently about Hadoop vendors taking advantage of the community, they’re definitely a good start.

Although, as Awadallah noted, MapReduce will still be around for quite a while if only because so many of today’s workloads and integrations with other products are based on MapReduce. But Hadoop’s future as the de facto all-purpose data-processing platform certainly looks a lot better.

Feature image courtesy of Shutterstock user Ase.