A team of professors that has created the in-memory Spark and Shark platforms for analyzing big data has raised nearly $13.9 million to commercialize those products. The company is still in stealth mode, but it’s called Databricks and Andreessen Horowitz led the round. The only information on the company’s website is, “We are using cutting-edge technology based on years of research to build next-generation software for analyzing and extracting value from data. We created Apache Spark and Shark, and are deeply committed to open-source.”
It also lists Databricks’ very impressive board of directors: Co-founder and CEO Ion Stoica (University of California, Berkeley professor and co-founder and CTO of Conviva); Co-founder and CTO Matei Zaharia (MIT professor); Ben Horowitz (general partner at Andreessen Horowitz and former Opsware co-founder and CEO); and Scott Shenker (University of California, Berkeley professor and former Nicira co-founder and CEO). Stoica, Zaharia and Shenker have all been heavily involved in the creation of Spark and Shark, which are part of the UC-Berkeley AMPLab institution. Spark is also an Apache incubator project.
For those not familiar with Spark, it is a big data platform written in Scala and designed to run very fast. Stoica wasn’t much more forthcoming on details during a recent phone call, but he did explain the promise of Spark as compared with Hadoop MapReduce. Essentially, he said, it’s up to 100 times faster if your dataset can fit in memory, but it’s built to be significantly faster even on disk. It’s also architected differently than MapReduce in ways that make it ideal for machine learning algorithms and data mining workloads, where users might want to iterate on on existing results or repeatedly query a dataset with low latency.
Shark is shorthand for “Hive on Spark,” which really means it’s a data warehousing framework compatible with Apache Hive but designed to run atop Spark rather than Hadoop MapReduce. Hive has become very popular as the de facto method of running SQL-like queries over data stored in Hadoop, but recently Hadoop vendors Cloudera and Hortonworks have undertaken their own efforts to either speed up Hive (which is slow because it relies on MapReduce) or eliminate it altogether for interactive queries. The Shark team claims it’s up to 100 times faster than Hive when running in memory.
It’s important to note, though, that Spark isn’t really an alternative to Hadoop as much as it is an alternative to MapReduce (and to Hive, with Shark). Many, many companies are already storing their data in the Hadoop Distributed File System, and Spark is designed to be compatible with HDFS. Especially with the advent of Apache YARN and Apache Mesos (another AMPLab creation), it’s very possible Spark could run alongside Hadoop MapReduce or Hive in the same cluster.
The interesting thing to watch, though, will be how competitive Databricks ends up being with Hadoop vendors such as Cloudera, Hortonworks and MapR. I seriously doubt it wants to get into the business of managing and supporting big data clusters from the servers up, but Databricks certainly could ding licensing and support revenues on Cloudera Impala and other non-MapReduce processing frameworks for Hadoop. If companies have yet to make the big leap into the Hadoop pool, it’s conceivable they could opt to go with a Spark-based stack from the get-go.
But we’ll see for sure what’s up when Databricks takes the wraps off its software in the next few months.
Feature image courtesy of Shutterstock user Dabarti CGI.
Update: This post was updated at 6:16 a.m. to correct Ion Stoica’s title. He is co-founder and CTO of Conviva.