Doing Big Data with Hadoop Is Like Making a Sandwich

Cloudera's Amr Awadallah, Pervasive Software's Mike Hoskins, 10gen's Dwight Merriman, Yahoo's Todd Papaioannou, and DataStax Ben WertherDuring an afternoon panel entitled “The Many Faces of MapReduce — Hadoop and Beyond,” moderator Gary Orenstein compared the two primary Hadoop components — MapReduce and the Hadoop Distributed File System — to the meat and bread of a sandwich. The goal, he said, is to make sure you choose the right combination of the two, depending on your tastes. Now, and certainly in the months to come, users will have a lot more choices for each.

At the MapReduce level, for example, users have to choose between the standard Hadoop MapReduce engine or higher-level languages like Apache Hive, which lets users do SQL-like queries instead of writing MapReduce jobs. Ben Werther, VP of product management at DataStax, noted that most Hadoop users he has come across are using Hive, which is why the new Hadoop distribution he announced during the panel — Brisk — uses Hive natively. Hive isn’t the only possible alternate language, though. Mike Hoskins, CTO of Pervasive Software, talked about his company’s DataRush product, which uses workflows and multicore optimization to improve the Hadoop experience. There’s also Yahoo’s (s YHOO) Pig language.

If MapReduce is the meat, then HDFS is the bread, and there are plenty of choices there, too. There are Hadoop-based options such as DataStax’s new Cassandra-based distribution, but some vendors will try to do it all within the database. Dwight Merriman of 10gen said that his company’s version of MongoDB includes a native version of MapReduce so customers can do their processing right within the database. That might be an ideal situation for users running specific types of jobs ideal for their database, noted Cloudera’s Amr Awadallah, but users might want to adapt their schemas to meet the job at hand and not deal with predetermined “recipes,” as he put it.

Which is why he likes the metaphor of Hadoop as a burrito rather than a sandwich. Hadoop should be a bunch of components thrown into a single product, offering something for everyone. Certainly, with a plethora of related and subprojects within Apache and being developed by individual companies, Hadoop is approaching burrito-like status. Todd Papaioannou, VP of cloud architecture at Yahoo, said this is one reason why Yahoo has recommitted itself to the Apache Hadoop project — to give users a full-featured, well-packaged distribution based in large part of Yahoo’s experiences.

Why all the fuss about Hadoop in the first place? According to Papaioannou, it’s because organizations no longer just throw away their data, but rather see it as an asset. Yahoo itself has about 200 petabytes of data and creates another 50TB per day. Its Hadoop environment includes 43,000 nodes spread across many clusters. DataStax’s Werther says data volumes are growing faster than is processor performance according to Moore’s Law, and that spans industries. Papaioannou, for example, cited a peer at IBM who said the majority of his global 500 customers are either using Hadoop or working toward it.

Watch live streaming video from gigaombigdata at