For now, Spark looks like the future of big data

Titles can be misleading. For example, the O’Reilly Strata + Hadoop World conference took place in San Jose, California, this week but Hadoop wasn’t the star of the show. Based on the news I saw coming out of the event, it’s another Apache project — Spark — that has people excited.

There was, of course, some big Hadoop news this week. Pivotal announced it’s open sourcing its big data technology and essentially building its Hadoop business on top of the [company]Hortonworks[/company] platform. Cloudera announced it earned $100 million in 2014. Lost in the grandstanding was MapR, which announced something potentially compelling in the form of cross-data-center replication for its MapR-DB technology.

But pretty much everywhere else you looked, it was technology companies lining up to support Spark: Databricks (naturally), Intel, Altiscale, MemSQL, Qubole and ZoomData among them.

Spark isn’t inherently competitive with Hadoop — in fact, it was designed to work with Hadoop’s file system and is a major focus of every Hadoop vendor at this point — but it kind of is. Spark is known primarily as an in-memory data-processing framework that’s faster and easier than MapReduce, but it’s actually a lot more. Among the other projects included under the Spark banner are file system, machine learning, stream processing, NoSQL and interactive SQL technologies.

The Spark platform, minus the Tachyon file system and some younger related projects.

The Spark platform, minus the Tachyon file system and some younger related projects.

In the near term, it probably will be that Hadoop pulls Spark into the mainstream because Hadoop is still at least a cheap, trusted big data storage platform. And with Spark still being relatively immature, it’s hard to see too many companies ditching Hadoop MapReduce, Hive or Impala for their big data workloads quite yet. Wait a few years, though, and we might start seeing some more tension between the two platforms, or at least an evolution in how they relate to each other.

This will be especially true if there’s a big breakthrough in RAM technology or prices drop to a level that’s more comparable to disk. Or if Databricks can convince companies they want to run their workloads in its nascent all-Spark cloud environment.

Attendees at our Structure Data conference next month in New York can ask Spark co-creator and Databricks CEO Ion Stoica all about it — what Spark is, why Spark is and where it’s headed. Coincidentally, Spark Summit East is taking place the exact same days in New York, where folks can dive into the nitty gritty of working with the platform.

There were also a few other interesting announcements this week that had nothing to do with Spark, but are worth noting here:

  • [company]Microsoft[/company] added Linux support for its HDInsight Hadoop cloud service, and Python and R programming language support for its Azure ML cloud service. The latter also now lets users deploy deep neural networks with a few clicks. For more on that, check out the podcast interview with Microsoft Corporate Vice President of Machine Learning (and Structure Data speaker) Joseph Sirosh embedded below.
  • [company]HP[/company] likes R, too. It announced a product called HP Haven Predictive Analytics that’s powered by a distributed version of R developed by HP Labs. I’ve rarely heard HP and data science in the same sentence before, but at least it’s trying.
  • [company]Oracle[/company] announced a new analytic tool for Hadoop called Big Data Discovery. It looks like a cross between Platfora and Tableau, and I imagine will be used primarily by companies that already purchase Hadoop in appliance form from Oracle. The rest will probably keep using Platfora and Tableau.
  • [company][/company] furthered its newfound business intelligence platform with a handful of features designed to make the product easier to use on mobile devices. I’m generally skeptical of Salesforce’s prospects in terms of stealing any non-Salesforce-related analytics from Tableau, Microsoft, Qlik or anyone else, but the mobile angle is compelling. The company claims more than half of user engagement with the platform is via mobile device, which its Director of Product Marketing Anna Rosenman explained to me as “a really positive testament that we have been able to replicate a consumer interaction model.”

If I missed anything else that happened this week, or if I’m way off base in my take on Hadoop and Spark, please share in the comments.

[soundcloud url=”″ params=”color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false” width=”100%” height=”166″ iframe=”true” /]

Survey reveals a few interesting numbers about Apache Spark

A new survey from startups Databricks and Typesafe revealed some interesting insights into how software developers are using the Apache Spark data-processing framework. Spark is an open source project that has attracted a lot of attention — and a lot of investment — over the past couple years as a faster, easier alternative to MapReduce for processing big data.

The survey included responses from more than 2,100 people, although considering the sources of the survey, the results are probably a bit biased toward Spark. Databricks, whose CEO Ion Stoica will be speaking at our Structure Data conference in March, is in the Spark business and its co-founders created the technology. Typesafe is focused on helping developers build next-generation applications, particularly by using the Scala language. One of Spark’s big selling points is its native support for Scala.

Elsewhere in the world, Hadoop, Spark’s much-larger predecessor and the platform for many Spark deployments, is still slowly working its way into the mainstream. This chart from the survey helps explain the type of respondents we’re dealing with:


Here are some of the findings about Spark use, specifically:

  • 13 percent of respondents are currently using Spark in production, while 51 percent are evaluating it and/or planning to use it in 2015. 28 percent said they have never heard of it.
  • The biggest use cases for Spark are faster batch processing (78 percent) and stream processing (60 percent).
  • A majority of respondents, 62 percent, use the Hadoop Distributed File System as data source for Spark. Other popular data sources include “databases” (46 percent), Apache Kafka (41 percent) and Amazon S3 (29 percent).
  • 56 percent of respondents run standalone Spark clusters, while 42 percent run it on Hadoop’s YARN framework. 26 percent run it on Apache Mesos, and 20 percent run it on Apache Cassandra.

You can download the whole thing here.

If I took away one thing from this survey, it’s that early adopters pretty clearly see Spark as the processing engine for a lot of workloads going forward, possibly relegating Hadoop to handling storage, cluster management and perhaps, with MapReduce, existing batch jobs that aren’t too time-sensitive. With a notable exception around interactive SQL queries, this actually sounds a lot like the future Hadoop software vendor Cloudera envisions for Spark.

Databricks announces a Spark cloud and $33M in venture capital

Big data startup Databricks keeps humming along, announcing on Monday a large round of venture capital and a new cloud service that aims to seed adoption of Spark — a framework it says is faster, easier and more versatile than other options.

4 reasons why Spark could jolt Hadoop into hyperdrive

Apache Spark might push MapReduce to the back burner faster than some people might like, but it will also boost the Hadoop overall ecosystem. The project’s co-creator Matei Zaharia explains why Spark is so popular now and where it fits into the big data ecosystem.

Spark is now part of MapR’s Hadoop distro, too

MapR is the latest Hadoop vendor to embrace Apache Spark, adding the entire Spark stack of technologies to its distribution. It’s a smart move by MapR, but just more validation that Spark might be the data-processing framework of the future.

Spark is a really big deal for big data, and Cloudera gets it

Cloudera has partnered with a startup called Databricks to integrate and support the Apache Spark data-processing platform within Cloudera’s Hadoop software. Spark, which is designed for speed and usability, is one of several technologies pushing Hadoop beyond MapReduce.