Hadoop 2.0 and loosely-coupled systems

I tend to use this space to report on news and developments in data and analytics during the prior week. But this past week, as with the last week of summer in years past, brought very little news with it. As such, it probably makes the most sense to take stock of developments we’ve seen over the last several weeks and what we can expect to see in the weeks and months to come.

Summer of YARN
A common thread across many of the weekly summaries from these past summer months has been the metamorphosis of Hadoop from MapReduce-specific data processing cluster framework to YARN-based general purpose computing grid. Coupled with Hadoop’s other major asset, its HDFS file system, Hadoop has found itself a formula for utilitarian appeal.

By decoupling cluster resource management from the batch-mode MapReduce processing algorithm, contributors to the Hadoop project have emancipated the technology towards greater versatility. This includes the ability to run Hive and (soon) Pig jobs in an interactive – rather than batch – fashion; compatibility with Apache Spark and its ability to run jobs in memory across servers in a cluster; and streaming data capabilities using Apache Storm, also running on Hadoop without MapReduce.

Not just copy cat
It looks like this kind of MapReduce-free innovation will continue. But the innovation will extend beyond mere conversion of previously MapReduce-specific technologies to run on YARN, to “net new” scenarios for Hadoop. For example, as reported by Gigaom’s Derrick Harris, Docker, the open source container technology that has enjoyed phenomenal growth over the past few months, may soon be YARN-compatible as well.

Docker-YARN compatibility would allow for sophisticated deployment of applications that run on Hadoop and yet execute in isolated domains that in turn make multi-user and multi-tenant Hadoop clusters more viable. Hadoop as a Service vendor Altiscale is building out much of these capabilities.

Loosen up
The interesting thing about the introduction of YARN in Hadoop 2.0 is illustrated by how perfectly it implements a key tenet of SOA (system oriented architecture): it took a tightly-coupled system and made it loosely coupled. Hadoop 2.0 offers backwards compatibility with older workloads while opening its architecture to other new components as well.

Factoring of tightly coupled systems into loosely coupled elegant components is a laudable goal. Evolving technologies in this fashion helps build credibility and adoption, because loosely coupled systems tend to comply with Enterprise IT governance criteria.

But there’s one important element to making such conversions truly successful. Making systems more modular and generalized works best when customers can still use it the old way. That’s why Hadoop 2.0 includes MapReduce 2.0, so that MapReduce jobs can still run natively, and so that Hive, Pig, Sqoop and Mahout can still run on MapReduce as well.

Accommodating futures, and contemporary use
Every technology must serve two realities: (1) that of offering new features and a roadmap for future features and (2) living in the present and realizing that most customers will keep using the platform in the fashion they’ve acclimated to.  By extracting Hadoop’s resource management but leaving its algorithmic engine intact, The Apache Hadoop team has struck that balance.

Hadoop benefits from modernization, but the industry benefits from knowing that its MapReduce ecosystem is substantial and that customers may continue to run in MapReduce mode by default. There’s nothing wrong with that. And there’s a lot that’s right with it.