Apache Mahout, a machine learning library for Hadoop since 2009, is joining the exodus away from MapReduce. The project’s community has decided to rework Mahout to support the increasingly popular Apache Spark in-memory data-processing framework, as well as the H2O engine for running machine learning and mathematical workloads at scale.
While data processing in Hadoop has traditionally been done using MapReduce, the batch-oriented framework has fallen out of vogue as users began demanding lower-latency processing for certain types of workloads — such as machine learning. However, nobody really wants to abandon Hadoop entirely because it’s still great for storing lots of data and many still use MapReduce for most of their workloads. Spark, which was developed at the University of California, Berkeley, has stepped in to fill that void in a growing number of cases where speed and ease of programming really matter.
H2O was developed separately by a startup called 0xdata (pronounced hexadata), although it’s also available as open source software. It’s an in-memory data engine specifically designed for running various types of types of statisical computations — including deep learning models — on data stored in the Hadoop Distributed File System.
“[H2O] looks like a really good technology layer to drive a lot of what Mahout’s been missing and remove the artificial constraints that have been in Mahout’s way,” Ted Dunning, project management committee member for Apache Mahout and the chief application architect at Hadoop software vendor MapR, told me. “A combination of H20 and Spark could really be something,” he added.
SriSatish Ambati, the founder and CEO of 0xdata, noted that the data science community isn’t married to one computational framework over another as long as they get the job done. It’s the higher-level stuff, including how people program models, that really matters. H2o natively supports the R programming language, for example, which is rather popular and would be a new capability for Mahout, Dunning said.
One could argue that the Mahout community had to embrace Spark, at least, if it wanted to remain relevant. Already, Cloudera is working on its Oryx machine learning framework that was designed in order to overcome Mahout’s shortcoming and will be ported to Spark at some point. The Spark community itself is also working on a set of machine learning libraries called MLlib.