Report: Apache Hadoop: Is one cluster enough?

Our library of 1700 research reports is available only to our subscribers. We occasionally release ones for our larger audience to benefit from. This is one such report. If you would like access to our entire library, please subscribe here. Subscribers will have access to our 2017 editorial calendar, archived reports and video coverage from our 2016 and 2017 events.
apache logo
Apache Hadoop: Is one cluster enough? by Paul Miller:
The open-source Apache Hadoop project continues its rapid evolution now and is capable of far more than its traditional use case of running a single MapReduce job on a single large volume of data. Projects like Apache YARN expand the types of workloads for which Hadoop is a viable and compelling solution, leading practitioners to think more creatively about the ways data is stored, processed, and made available for analysis.
Enthusiasm is growing in some quarters for the concept of a “data lake” — a single repository of data stored in the Hadoop Distributed File System (HDFS) and accessed by a number of applications for different purposes. Most of the prominent Hadoop vendors provide persuasive examples of this model at work but, unsurprisingly, the complexities of real-world deployment do not always neatly fit the idealized model of a single (huge) cluster working with a single (huge) data lake.
In this report we discuss some of the circumstances in which more complex requirements may exist, and explore a set of solutions emerging to address them.
To read the full report, click here.

Survey reveals a few interesting numbers about Apache Spark

A new survey from startups Databricks and Typesafe revealed some interesting insights into how software developers are using the Apache Spark data-processing framework. Spark is an open source project that has attracted a lot of attention — and a lot of investment — over the past couple years as a faster, easier alternative to MapReduce for processing big data.

The survey included responses from more than 2,100 people, although considering the sources of the survey, the results are probably a bit biased toward Spark. Databricks, whose CEO Ion Stoica will be speaking at our Structure Data conference in March, is in the Spark business and its co-founders created the technology. Typesafe is focused on helping developers build next-generation applications, particularly by using the Scala language. One of Spark’s big selling points is its native support for Scala.

Elsewhere in the world, Hadoop, Spark’s much-larger predecessor and the platform for many Spark deployments, is still slowly working its way into the mainstream. This chart from the survey helps explain the type of respondents we’re dealing with:


Here are some of the findings about Spark use, specifically:

  • 13 percent of respondents are currently using Spark in production, while 51 percent are evaluating it and/or planning to use it in 2015. 28 percent said they have never heard of it.
  • The biggest use cases for Spark are faster batch processing (78 percent) and stream processing (60 percent).
  • A majority of respondents, 62 percent, use the Hadoop Distributed File System as data source for Spark. Other popular data sources include “databases” (46 percent), Apache Kafka (41 percent) and Amazon S3 (29 percent).
  • 56 percent of respondents run standalone Spark clusters, while 42 percent run it on Hadoop’s YARN framework. 26 percent run it on Apache Mesos, and 20 percent run it on Apache Cassandra.

You can download the whole thing here.

If I took away one thing from this survey, it’s that early adopters pretty clearly see Spark as the processing engine for a lot of workloads going forward, possibly relegating Hadoop to handling storage, cluster management and perhaps, with MapReduce, existing batch jobs that aren’t too time-sensitive. With a notable exception around interactive SQL queries, this actually sounds a lot like the future Hadoop software vendor Cloudera envisions for Spark.

Why data science matters and how technology makes it possible

When Hilary Mason talks about data, it’s a good idea to listen.

She was chief data scientist at, data scientist in residence at venture capital firm Accel Partners, and is now founder and CEO of research company Fast Forward Labs. More than that, she has been a leading voice of the data science movement over the past several years, highlighting what’s possible when you mix the right skills with a little bit of creativity.

Mason came on the Structure Show podcast this week to discuss what she’s excited about and why data science is a legitimate field. Here are some highlights from the interview, but it’s worth listening to the whole thing for her thoughts on everything from the state of the art in natural language processing to the state of data science within corporate America.

And if you want to see Mason, and a lot of other really smart folks, talk about the future of data in person, come to our Structure Data conference that takes place March 18-19 in New York.

[soundcloud url=”” params=”color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false” width=”100%” height=”166″ iframe=”true” /]

Download This Episode

Subscribe in iTunes

The Structure Show RSS Feed

How far big data tech has come, and how fast

“Things that maybe 10 or 15 years ago we could only talk about in a theoretical sense are now commodities that we take completely for granted,” Mason said in response to a question about how the data field has evolved.

When she started at, she explained, the whole product was just shortened links shared across the web. That was it. So she and her colleagues had a lot of freedom rather early on to carry out data science research in an attempt to find new directions to take the company.

Shivon Zilis, VC, Bloomberg Beta; Sven Strohband, Partner and CTO, Khosla Ventures; Hilary Mason, Data Scientist in Residence, Accel Partners; Jalak Jobanputra, Managing Partner, FuturePerfect Ventures.

Hilary Mason (center) at Structure Data 2014.

“That was super fun, and also the first time I realized that the technology we were building and using was actually allowing us to gather more data about natural human behavior than we’ve ever, as a research community, had access to,” Mason said.

“Hadoop existed, but was still extremely hard to use at that point,” she continued. “Now it’s something where I hit a couple buttons and a cloud spins up for me and does my calculations and it’s really lovely.”

Defending data science

It was only a couple years ago that “data scientist” was deemed the sexiest job of the 21st century, but that job title and the field of data science have always been subject to a fair amount of derision. What’s more, there’s now a collection of software vendors claiming they can automate away some of the need for data scientists via their products.

Mason disagrees with the criticism and the idea that you can automate all, or even the most important parts, of a data scientist’s job:

“You have math, you have programming, and then you have what is essentially empathy domain knowledge and the ability to articulate things clearly. So I think the title is relevant because those three things have not been combined in one job before. And the reason we can do that today, even though none of these things is new, is just that the technology has progressed so much that it’s possible for one person to do all these things — not perfectly, but well enough.”

She continued:

“A lot of people seem to think that data science is just a process of adding up a bunch of data and looking at the results, but that’s actually not at all what the process is. To do this well, you’re really trying to understand something nuanced about the real world, you have some incredibly messy data at hand that might be able to inform you about something, and you’re trying to use mathematics to build a model that connects the two. But that understanding of what the data is really telling you is something that is still a purely human capability.”

The next big things: Deep learning, IoT and intelligent operations

As for other technologies that have Mason excited, she said deep learning is high up on the list, as are new approaches to natural language processing and understanding (those two are actually quite connected in some aspects).

“Also, being able to use AI to automate the bounds of engineering problems,” Mason said. “There are a lot of techniques we already understand pretty well that could be well applied in like operations or data center space where we haven’t seen a lot of that.”

Hilary Mason

Hilary Mason (second from right) at Structure Data 2014.

Mason thinks one of the latest data technologies on the path to commoditization is stream processing for real-time data, and Fast Forward Labs is presently investigating probabilistic approaches to stream processing. That is, giving up a little bit of accuracy in the name of speed. However, she said, it’s important to think about the right architecture for the job, especially in an era of cheaper sensors and more-powerful, lower-power processors.

“You don’t actually need that much data to go into your permanent data store, where you’re going to spend a lot of computation resources analyzing it,” Mason explained. “If you know what you’re looking for, you can build a probabilistic system that just models the thing you’re trying to model in a very efficient way. And what this also means is that you can push a lot of that computation from a cloud cluster actually onto the device itself, which I think will open up a lot of cool applications, as well.”

Cloudera tunes Google’s Dataflow to run on Spark

Hadoop software company Cloudera has worked with Google to make Google’s Dataflow programming model run on Apache Spark. Dataflow, which Google announced as a cloud service in June, lets programmers write the same code for both batch- and stream-processing tasks. Spark is becoming a popular environment for running both types of tasks, even in Hadoop environments.

Cloudera has open sourced the code as part of its Cloudera Labs program. [company]Google[/company] had previously open sourced the Dataflow SDK that Cloudera used to carry out this work.

Cloudera’s Josh Wills explains the promise of Dataflow like this in a Tuesday-morning blog post:

[T]he streaming execution engine has strong consistency guarantees and provides a windowing model that is even more advanced than the one in Spark Streaming, but there is still a distinct batch execution engine that is capable of performing additional optimizations to pipelines that do not process streaming data. Crucially, the client API for the batch and the stream processing engines are identical, so that any operation that can be performed in one context can also be performed in the other, and moving a pipeline from batch mode to streaming mode should be as seamless as possible.

Essentially, Dataflow should make it easier to build reliable big data pipelines than previous architectural models, which often involved managing Hadoop MapReduce for batch processing and something like Storm for stream processing. Running Dataflow on Spark means there is a single set of APIs and a single processing engine, which happens to be significantly faster than MapReduce for most jobs.

While Dataflow on Spark won’t end problems of complexity, speed or scale, it’s another step along a path that has resulted in faster, easier big data technologies at every turn. Hadoop was too slow for a lot of applications and too complicated for a lot of users, but those barriers to use are falling away fast.

If you want to hear more about where the data infrastructure space is headed, come to our Structure Data conference March 18-19 in New York. Speakers include Eric Brewer, vice president of infrastructure at Google; Ion Stoica, co-creator of Apache Spark and CEO of [company]Databricks[/company]; and Tom Reilly, Rob Bearden and John Schroeder, the CEOs of Cloudera, [company]Hortonworks[/company] and MapR, respectively.

A startup wants to build a trading platform for sensor data

A startup out of Las Vegas is trying to capitalize on a very difficult, and potentially very lucrative, opportunity within the internet of things. The company, called Terbine, wants to become a data broker for the world of connected devices by building a platform where companies can buy, sell and share the data their sensors are collecting.

Terbine is still very young — the company has just raised seed funding from a firm called Incapture Group — but founder and CEO David Knight has big plans. He’s looking at everything from billboards to drones, from shipping vessels to satellites, as potential sources for a massive database of information about what’s happening in the physical world. He thinks companies will pay big money to able to monitor pedestrian traffic in key markets thousands of miles away, for example, or to identify the potential closure of shipping lanes because of an oil spill long before it’s being reported.

Terbine would play the middleman in all of these transactions, collecting the data, curating and formatting it, and then managing access to it. Knight envisions a market-like approach to access, where some data might be free, but most would be priced based on how timely it is, how rare it is or how relevant it is at any given time. He’s looking at sectors such as energy, agriculture, and oil and gas — which has become much less centralized thanks to fracking — as early targets.

David Knight

David Knight

“I realize a lot of people are talking about the internet of things,” Knight said, “but so far most of the conversation reminds me of the early days of CB radio.” Back then lots of people had a radio, like lots of people now have sensors, but there was no place to go to connect with the most interesting people.

It’s not an insane idea — even [company]Cisco[/company] has pitched the idea of “data infomediaries,” and [company]IBM[/company] has suggested companies could make money by recycling data — but so far no one has really been able to pull it off. There are myriad regulatory hurdles to overcome, not to mention the technological challenges of building such an infrastructure. Terbine has already prototyped a platform for the data exchange on Amazon Web Services and is thinking about its edge-network architecture, but actually building it is another story.

There’s also the not-so-small question of how Terbine, or any company attempting to build such a platform, will get companies on board with the data-sharing plan. Many data marketplaces so far have been populated with data that’s either not too interesting or, in the case of some early government efforts, not available in usable formats.


Knight thinks companies will certainly be willing to pay for quality data, but acknowledges that bartering (give some data to get some data) might be a better method for getting them initially involved and proving there’s value in the exchange. He said Terbine also hopes to deploy its own network of sensors with strategic partners so it can ensure certain data it perceives as valuable will be available.

Being headquartered in Las Vegas might be a strategic advantage, Knight said, because of the highly interconnected SuperNAP data center (where he hopes to eventually host Terbine’s platform) and the Department of Energy’s Remote Sensing Laboratory. He’s hopeful the latter could offer a testbed for some of Terbine’s plans, and possibly some talent.

It’s a longshot to be sure, but Knight, who has previously been involved in the quest to bring the Endeavour space shuttle to the California Science Center and is also working on a high-tech virtual reality tour of the craft, says he’s game for it.

“What I really like,” he said, “is being involved with things people say can’t be done.”

Microsoft adds stream processing and pipeline tools to Azure

Microsoft announced a trio of new cloud data services on Wednesday aimed at stream processing and data pipelines. They’re not revolutionary, but they appear to have their own advantages, and they also help ensure Azure keeps up with the Joneses in cloud computing.