Datameer 5, ExtraHop 4: start your (choice of) engines

I’ve written much in this space, and on the Gigaom blog, about YARN and its impact on how Hadoop will be used. Not very long ago, Hadoop was synonymous with the MapReduce algorithm and non-interactive batch processing. With YARN, Hadoop has morphed into a general purpose distributed processing and distributed storage system. To that, we may say “bully for Hadoop!” But what about tools built on top of it?

One such tool, Datameer, from the company of the same name, implements a full-fledged data discovery environment on top of Hadoop. Gigaom Research’s Sector Roadmap on Data Discovery tools, published in March of this year, and authored by me, featured Datameer as the highest-scoring product of the six that were evaluated with a numeric score. Datameer 4.0 was dependent on MapReduce, and that made its strong showing somewhat counterintuitive. But the disruption vectors in the report supported the outcome.

Get Smart
My own hope was that a future version of Datameer would embrace Hadoop 2.0 and YARN as strongly as the then current version embraced Hadoop 1.0 and MapReduce. Today, Datameer is announcing its 5.0 release and, along with it, a new feature called Smart Execution that exploits YARN’s ability to bring multiple execution engines to bear on the same data.

For those who want a few gory details, Smart Execution is directed acyclic graph (DAG)-based, selecting one of a few execution environments for each section of an analysis request. Sometimes, it will perform its work in-memory on a single compute node, to avoid network bottlenecks. At other times it will parallelize, using a combination of Tez and YARN in certain cases, and good old MapReduce (albeit on YARN) in others. I call these gory details because as a user, you needn’t worry about them. You just give Datameer your data and your analysis tasks, and it will pick the right engine for the job.

Datameer’s future…and Hadoop’s
Datameer says the architecture it’s used for Smart Execution makes it open and pluggable such that other execution engines can be integrated in the future. It calls Apache Spark out by name as a candidate for such future integration, and it also allows for the possibility of integrating engines that don’t yet exist. By taking this approach, Datameer is playing to Hadoop and YARN’s strengths. By abstracting this technology, and relieving users of the burden of understanding engine selection, Datameer is utilizing Hadoop as the embedded engine it should be.

At this point, it seems that most Hadoop-related engines are being reengineered to run on top of YARN. One such engine is Apache Storm, designed to handle processing of streaming data. Storm’s growth in popularity has been on a huge incline of late, and yet it has remained in incubation status at the Apache Software Foundation (ASF). That changed on Monday, when the ASF announced that Storm had graduated to top-level project status. Perhaps Datameer v.Next will take on streaming data, and perhaps it will add Apache Storm to its arsenal of YARN-compatible execution engines.

As is usually the case in this industry and others, standards are the most effective force for change. Hadoop is a standard, and so now is YARN. Through the efforts of a team at Yahoo, Storm now runs on the YARN standard. That virtuous cycle is likely to continue.

JSON takes an ExtraHop
Another standard that has effected massive change in the computing industry is that of JavaScript Object Notation (JSON). JSON is the cornerstone of document store databases like MongoDB and CouchDB. It’s also a standard for encoding data that is at this point second nature to most Web developers.

JSON is becoming second nature to a growing number of companies, too. The folks at ExtraHop have a very cool (eponymous) streaming data platform that can extract data directly from network data packets and run analytics on it. Up until now, however, that data would get stored in a proprietary data store that was limited to a 30-day look-back period. Analytics on raw data off the wire is extremely valuable, but less so when non-trivial constraints are placed on storage of that data

But ExtraHop 4.0 was released on Tuesday and, with it, those constraints have disappeared. First, ExtraHop’s native data store has been enhanced to store much larger volumes of data. But second, ExtraHop 4.0 now implements an Open Data Stream that makes all of its data available in JSON format, and thus readily stored in other databases, including MongoDB.

What about Oracle?
JSON compatibility isn’t limited to NoSQL databases like MongoDB. In fact, Oracle 12c can handle JSON natively now. This and other facets of Oracle’s data stack have been proudly trumpeted at Oracle OpenWorld, which has been going on this week in San Francisco. Gigaom Research’s own George Gilbert published an OpenWorld pre-game post on Friday. Expect another post from George shortly on Oracle’s JSON chops, what it means for the juggernaut relational database and what it means for its namesake company, now under new management.