For now, Spark looks like the future of big data

Titles can be misleading. For example, the O’Reilly Strata + Hadoop World conference took place in San Jose, California, this week but Hadoop wasn’t the star of the show. Based on the news I saw coming out of the event, it’s another Apache project — Spark — that has people excited.

There was, of course, some big Hadoop news this week. Pivotal announced it’s open sourcing its big data technology and essentially building its Hadoop business on top of the [company]Hortonworks[/company] platform. Cloudera announced it earned $100 million in 2014. Lost in the grandstanding was MapR, which announced something potentially compelling in the form of cross-data-center replication for its MapR-DB technology.

But pretty much everywhere else you looked, it was technology companies lining up to support Spark: Databricks (naturally), Intel, Altiscale, MemSQL, Qubole and ZoomData among them.

Spark isn’t inherently competitive with Hadoop — in fact, it was designed to work with Hadoop’s file system and is a major focus of every Hadoop vendor at this point — but it kind of is. Spark is known primarily as an in-memory data-processing framework that’s faster and easier than MapReduce, but it’s actually a lot more. Among the other projects included under the Spark banner are file system, machine learning, stream processing, NoSQL and interactive SQL technologies.

The Spark platform, minus the Tachyon file system and some younger related projects.

The Spark platform, minus the Tachyon file system and some younger related projects.

In the near term, it probably will be that Hadoop pulls Spark into the mainstream because Hadoop is still at least a cheap, trusted big data storage platform. And with Spark still being relatively immature, it’s hard to see too many companies ditching Hadoop MapReduce, Hive or Impala for their big data workloads quite yet. Wait a few years, though, and we might start seeing some more tension between the two platforms, or at least an evolution in how they relate to each other.

This will be especially true if there’s a big breakthrough in RAM technology or prices drop to a level that’s more comparable to disk. Or if Databricks can convince companies they want to run their workloads in its nascent all-Spark cloud environment.

Attendees at our Structure Data conference next month in New York can ask Spark co-creator and Databricks CEO Ion Stoica all about it — what Spark is, why Spark is and where it’s headed. Coincidentally, Spark Summit East is taking place the exact same days in New York, where folks can dive into the nitty gritty of working with the platform.

There were also a few other interesting announcements this week that had nothing to do with Spark, but are worth noting here:

  • [company]Microsoft[/company] added Linux support for its HDInsight Hadoop cloud service, and Python and R programming language support for its Azure ML cloud service. The latter also now lets users deploy deep neural networks with a few clicks. For more on that, check out the podcast interview with Microsoft Corporate Vice President of Machine Learning (and Structure Data speaker) Joseph Sirosh embedded below.
  • [company]HP[/company] likes R, too. It announced a product called HP Haven Predictive Analytics that’s powered by a distributed version of R developed by HP Labs. I’ve rarely heard HP and data science in the same sentence before, but at least it’s trying.
  • [company]Oracle[/company] announced a new analytic tool for Hadoop called Big Data Discovery. It looks like a cross between Platfora and Tableau, and I imagine will be used primarily by companies that already purchase Hadoop in appliance form from Oracle. The rest will probably keep using Platfora and Tableau.
  • [company][/company] furthered its newfound business intelligence platform with a handful of features designed to make the product easier to use on mobile devices. I’m generally skeptical of Salesforce’s prospects in terms of stealing any non-Salesforce-related analytics from Tableau, Microsoft, Qlik or anyone else, but the mobile angle is compelling. The company claims more than half of user engagement with the platform is via mobile device, which its Director of Product Marketing Anna Rosenman explained to me as “a really positive testament that we have been able to replicate a consumer interaction model.”

If I missed anything else that happened this week, or if I’m way off base in my take on Hadoop and Spark, please share in the comments.

[soundcloud url=”″ params=”color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false” width=”100%” height=”166″ iframe=”true” /]

Survey reveals a few interesting numbers about Apache Spark

A new survey from startups Databricks and Typesafe revealed some interesting insights into how software developers are using the Apache Spark data-processing framework. Spark is an open source project that has attracted a lot of attention — and a lot of investment — over the past couple years as a faster, easier alternative to MapReduce for processing big data.

The survey included responses from more than 2,100 people, although considering the sources of the survey, the results are probably a bit biased toward Spark. Databricks, whose CEO Ion Stoica will be speaking at our Structure Data conference in March, is in the Spark business and its co-founders created the technology. Typesafe is focused on helping developers build next-generation applications, particularly by using the Scala language. One of Spark’s big selling points is its native support for Scala.

Elsewhere in the world, Hadoop, Spark’s much-larger predecessor and the platform for many Spark deployments, is still slowly working its way into the mainstream. This chart from the survey helps explain the type of respondents we’re dealing with:


Here are some of the findings about Spark use, specifically:

  • 13 percent of respondents are currently using Spark in production, while 51 percent are evaluating it and/or planning to use it in 2015. 28 percent said they have never heard of it.
  • The biggest use cases for Spark are faster batch processing (78 percent) and stream processing (60 percent).
  • A majority of respondents, 62 percent, use the Hadoop Distributed File System as data source for Spark. Other popular data sources include “databases” (46 percent), Apache Kafka (41 percent) and Amazon S3 (29 percent).
  • 56 percent of respondents run standalone Spark clusters, while 42 percent run it on Hadoop’s YARN framework. 26 percent run it on Apache Mesos, and 20 percent run it on Apache Cassandra.

You can download the whole thing here.

If I took away one thing from this survey, it’s that early adopters pretty clearly see Spark as the processing engine for a lot of workloads going forward, possibly relegating Hadoop to handling storage, cluster management and perhaps, with MapReduce, existing batch jobs that aren’t too time-sensitive. With a notable exception around interactive SQL queries, this actually sounds a lot like the future Hadoop software vendor Cloudera envisions for Spark.

Cloudera tunes Google’s Dataflow to run on Spark

Hadoop software company Cloudera has worked with Google to make Google’s Dataflow programming model run on Apache Spark. Dataflow, which Google announced as a cloud service in June, lets programmers write the same code for both batch- and stream-processing tasks. Spark is becoming a popular environment for running both types of tasks, even in Hadoop environments.

Cloudera has open sourced the code as part of its Cloudera Labs program. [company]Google[/company] had previously open sourced the Dataflow SDK that Cloudera used to carry out this work.

Cloudera’s Josh Wills explains the promise of Dataflow like this in a Tuesday-morning blog post:

[T]he streaming execution engine has strong consistency guarantees and provides a windowing model that is even more advanced than the one in Spark Streaming, but there is still a distinct batch execution engine that is capable of performing additional optimizations to pipelines that do not process streaming data. Crucially, the client API for the batch and the stream processing engines are identical, so that any operation that can be performed in one context can also be performed in the other, and moving a pipeline from batch mode to streaming mode should be as seamless as possible.

Essentially, Dataflow should make it easier to build reliable big data pipelines than previous architectural models, which often involved managing Hadoop MapReduce for batch processing and something like Storm for stream processing. Running Dataflow on Spark means there is a single set of APIs and a single processing engine, which happens to be significantly faster than MapReduce for most jobs.

While Dataflow on Spark won’t end problems of complexity, speed or scale, it’s another step along a path that has resulted in faster, easier big data technologies at every turn. Hadoop was too slow for a lot of applications and too complicated for a lot of users, but those barriers to use are falling away fast.

If you want to hear more about where the data infrastructure space is headed, come to our Structure Data conference March 18-19 in New York. Speakers include Eric Brewer, vice president of infrastructure at Google; Ion Stoica, co-creator of Apache Spark and CEO of [company]Databricks[/company]; and Tom Reilly, Rob Bearden and John Schroeder, the CEOs of Cloudera, [company]Hortonworks[/company] and MapR, respectively.

The 5 stories that defined the big data market in 2014

There is no other way to put it: 2014 was a huge year for the big data market. It seems years of talk about what’s possible are finally giving way to some real action on the technology front — and there’s a wave of cash following close behind it.

Here are the five stories from the past year that were meaningful in their own rights, but really set the stage for bigger things to come. We’ll discuss many of these topics in depth at our Structure Data conference in March, but until then feel free to let me know in the comments what I missed, where I went wrong or why I’m right.

5. Satya Nadella takes the reins at Microsoft

Microsoft CEO Satya Nadella has long understood the importance of data to the company’s long-term survival, and his ascendance to the top spot ensures Microsoft won’t lose sight of that. Since Nadella was appointed CEO in February, we’ve already seen Microsoft embrace the internet of things, and roll out new data-centric products such as Cortana, Skype Translate and Azure Machine Learning. Microsoft has been a major player in nearly every facet of IT for decades and how it executes in today’s data-driven world might dictate how long it remains in the game.

Microsoft CEO Satya Nadella speaks at a Microsoft Cloud event. Photo by Jonathan Vanian/Gigaom

Satya Nadella speaks at a Microsoft Cloud event.

4. Apache Spark goes legit

It was inevitable that the Spark data-processing framework would become a top-level project within the Apache Software Foundation, but the formal designation felt like an official passing-of-the-torch nonetheless. Spark promises to do for the Hadoop ecosystem all the things MapReduce never could around speed and usability, so it’s no wonder Hadoop vendors, open source projects and even some forward-thinking startups are all betting big on the technology. Databricks, the first startup trying to commercialize Spark, has benefited from this momentum, as well.

Ion Stoica

Spark co-creator and Databricks CEO Ion Stoica.

3. IBM bets its future on Watson

Big Blue might have abandoned its server and microprocessor businesses, but IBM is doubling down on cognitive computing and expects its new Watson division to grow into a $10 billion business. The company hasn’t wasted any time trying to get the technology into users’ hands — it has since announced numerous research and commercial collaborations, highlighted applications built atop Watson and even worked Watson tech into the IBM cloud platform and a user-friendly analytics service. IBM’s experiences with Watson won’t only affect its bottom line; they could be a strong indicator of how enterprises will ultimately use artificial intelligence software.

watson headquarters

A shot of IBM’s new Watson division headquarters in Manhattan.

2. Google buys DeepMind

It’s hard to find a more exciting technology field than artificial intelligence right now, and deep learning is the force behind a lot of that excitement. Although there were a myriad of acquisitions, startup launches and research breakthroughs in 2014, it was Google’s acquisition of London-based startup DeepMind in January that set the tone for the year. The price tag, rumored to be anywhere from $450 million to $628 million, got the mainstream technology media paying attention, and it also let deep learning believers (including those at competing companies) know just how important deep learning is to Google.

Jeffrey Dean - Google Fellow, Google

Google’s Jeff Dean talks about early deep learning results at Structure 2013.

1. Hortonworks goes public

Cloudera’s massive (and somewhat convoluted) deal with Intel boosted the company’s valuation past $4 billion and sent industry-watchers atwitter, but the Hortonworks IPO in December was really a game-changer. It came faster than most people expected, was more successful than many people expected, and should put the pressure on rivals Cloudera and MapR to act in 2015. With a billion-plus-dollar market cap and public market trust, Hortonworks can afford to scale its business and technology — and maybe even steal some valuable mindshare — as the three companies vie to own what could be a humongous software market in a few years’ time.


Hortonworks rings the opening bell on its IPO day.

Honorable mentions

Databricks announces a Spark cloud and $33M in venture capital

Big data startup Databricks keeps humming along, announcing on Monday a large round of venture capital and a new cloud service that aims to seed adoption of Spark — a framework it says is faster, easier and more versatile than other options.