Survey reveals a few interesting numbers about Apache Spark

A new survey from startups Databricks and Typesafe revealed some interesting insights into how software developers are using the Apache Spark data-processing framework. Spark is an open source project that has attracted a lot of attention — and a lot of investment — over the past couple years as a faster, easier alternative to MapReduce for processing big data.

The survey included responses from more than 2,100 people, although considering the sources of the survey, the results are probably a bit biased toward Spark. Databricks, whose CEO Ion Stoica will be speaking at our Structure Data conference in March, is in the Spark business and its co-founders created the technology. Typesafe is focused on helping developers build next-generation applications, particularly by using the Scala language. One of Spark’s big selling points is its native support for Scala.

Elsewhere in the world, Hadoop, Spark’s much-larger predecessor and the platform for many Spark deployments, is still slowly working its way into the mainstream. This chart from the survey helps explain the type of respondents we’re dealing with:


Here are some of the findings about Spark use, specifically:

  • 13 percent of respondents are currently using Spark in production, while 51 percent are evaluating it and/or planning to use it in 2015. 28 percent said they have never heard of it.
  • The biggest use cases for Spark are faster batch processing (78 percent) and stream processing (60 percent).
  • A majority of respondents, 62 percent, use the Hadoop Distributed File System as data source for Spark. Other popular data sources include “databases” (46 percent), Apache Kafka (41 percent) and Amazon S3 (29 percent).
  • 56 percent of respondents run standalone Spark clusters, while 42 percent run it on Hadoop’s YARN framework. 26 percent run it on Apache Mesos, and 20 percent run it on Apache Cassandra.

You can download the whole thing here.

If I took away one thing from this survey, it’s that early adopters pretty clearly see Spark as the processing engine for a lot of workloads going forward, possibly relegating Hadoop to handling storage, cluster management and perhaps, with MapReduce, existing batch jobs that aren’t too time-sensitive. With a notable exception around interactive SQL queries, this actually sounds a lot like the future Hadoop software vendor Cloudera envisions for Spark.