Hadoop’s new components go mainstream

Once upon a time (you know, like a year and a half ago), most Hadoop distributions included Hadoop itself, along with several components that work on top of it:

  • HBase, for column family NoSQL database management
  • Hive, providing a SQL-like abstractuion layer over Hadoop’s batch mode MapReduce engine
  • Pig, providing a data transformation scripting language over MapReduce
  • Sqoop, an import/export bridge between the Hadoop Distributed File System (HDFS) and Data Warehousing platforms
  • Mahout, a machine learning layer over MapReduce
  • Flume, providing for log file integration into HDFS

For quite a while, the above manifest of components was standard and ubiquitous.  It was also difficult to work with, lacking a unified user interface and being too slow for many applications, with Hive, Pig, Sqoop and Mahout all working exclusively in batch mode.

In October of 2012, at the Strata + Hadoop World event in New York City, Cloudera introduced its Impala SQL-on-Hadoop engine, which essentially implemented a massively parallel processing (MPP) data warehouse directly over HDFS, bypassing Hadoop’s MapReduce engine entirely.  Suddenly the hegemony of the “gang of six” (i.e. the components in the bulleted list above) had ended.

Within the next year, virtually all Hadoop and data warehouse vendors embarked upon and/or introduced their own SQL-on-Hadoop solutions. Even Hadoop itself sought to break free, as the development of YARN (yet another resource negotiator) demoted MapReduce to being only one of potentially several hosted data processing frameworks.

That’s not all
The Apache Tez project sought to create a processing layer over YARN, through which Hive and Pig could eventually run MapReduce-free. And HortonworksStinger initiative, with participation from Microsoft, worked to imbue Hive’s query engine with numerous optimizations and Tez compatibility.

Then there’s Apache Spark, which implements a distributed in-memory engine over Hadoop clusters. Its companion project, Shark, provides a SQL-like query layer on top.

And remember that lack of a unified user interface for all these tools?  An open source project incubated at Cloudera, called Hue, introduced a browser-based interface to address this problem. It started out on the primitive side, but it has evolved quite a bit.

Bringing it all together
With so many new projects, backed and even incubated by competitors, this new landscape had the potential to fragment the Hadoop platform very badly. The interesting thing, though, is that it’s rapidly reaching an impressive equilibrium:

  • The third and final phase of the Stinger initiative was completed last month.  YARN and Tez are becoming ubiquitous and Hive version .13 runs on top of both of them.
  • Impala, which continues to be a Cloudera-backed project, is now nonetheless included not only in Cloudera’s Hadoop distribution, but also in MapR‘s and even on Amazon Web Services’ Elastic MapReduce (EMR) service.
  • Version 3.0 of Cascading, a toolkit for Java developers to build applications on Hadoop, was released just last week.  With this release, Cascading supports Tez. Concurrent, the company behind Cascading, has committed to supporting Spark in the future.
  • Speaking of Spark, it’s now included with Hadoop distributions from Cloudera, Hortonworks and MapR.
  • And the Hue user interface, once exclusively available only in Cloudera’s Hadoop distribution, now ships inside Hortonworks’ and MapR’s distros as well

With this year’s Hadoop Summit event, to be held in San Jose, CA in less than two weeks, we can reasonably expect more announcements around evolved Hadoop distributions and continued institutionalization of its latter-day components.  With so much new functionality being pushed down into the core platform, it will be interesting to see how vendors in the Hadoop space next attempt to differentiate.