Why Yahoo Is Discontinuing Its Hadoop Distribution

Updated: The big data marketplace has contracted a bit, as Yahoo (s yhoo) is ceasing development of its Yahoo Distribution of Hadoop and will be folding it back into the Apache Hadoop project. The company announced the decision in a blog post yesterday, citing a goal “to make Apache Hadoop THE open source platform for big data” as a driving force behind its new strategy. It’s probably a wise idea, because having three free competing distributions — Yahoo, Apache and Cloudera — unnecessarily compartmentalized features and development efforts, and possibly left new Hadoop users with a tough decision in terms of which distribution to download and get to working on.

Although Yahoo maintains several seats on the Apache Hadoop Project Management Committee, it used to work more directly with Apache on its Hadoop releases. Since Yahoo began refocusing its efforts on its own distribution, however, the user community was forced to choose between distributions. Despite the fact that both distributions were open-source and could, in theory, share many of the same features and road maps, that wasn’t always what happened. Eric Baldeschwieler, author of the blog announcing this news, wrote of a “turbulent” Apache Hadoop community and hinted that there might have been some hard feelings among Apache devotees and Yahoo:

We believe that more focus on communicating our goals to the Apache Hadoop community, and more willingness to compromise on how we get to those goals, will help us get back to making Hadoop even better.

Update: Whatever was causing the turbulence seems to be in the past, however — at least according to the company line. In an email response to my inquiry, Todd Papaioannou, vice president of cloud architecture for Yahoo, wrote: “Yahoo! and Apache have a strong, collaborative relationship. We are a big supporter and contributor to Apache. We passionately believe in the power and benefits of Open Source and wanted to return to a state where the community can download and use stable releases of Hadoop directly from Apache.”

Notably absent from Baldeschwieler’s post is Cloudera, which also has a strong presence in the Apache Hadoop community as well as its own distribution, the Cloudera Distribution for Apache Hadoop (CDH). I noted in late June that there might end being a schism between the various free, open-source distributions as users have to choose between Cloudera’s enterprise-tuned version of Apache Hadoop and Yahoo’s increasingly feature-rich distribution. Now, it appears, that issue has been resolved. Although there are proprietary products floating around from vendors such as IBM (s ibm) and Datameer, the core Hadoop-development community will focus its efforts on Apache Hadoop as the foundational product, with CDH playing the part of the commercial-grade Apache distribution.

Update: Cloudera expressed its feelings in a blog post this morning, calling Yahoo’s decision “great news” and explaining the problems of having too many Hadoop distributions floating about:

Currently, many people running Hadoop use patched versions of the Apache Hadoop package that combine features contributed by Yahoo! and others, but may not yet be collectively available in a single Apache release. … Collecting that work into a single source code package and building a system with the best quality and feature set has been hard work.

New users of Hadoop have generally found this assembly work to be too much trouble.

Post author Charles Zedlewski added that the new alignment between Yahoo and Apache will simplify Cloudera’s job of integrating and patching new features as part of its production-grade distribution. Zedlewski compares Cloudera to Canonical and Red Hat in the Linux world.

To hear more about Hadoop and big data, in general, be sure to attend our Structure Big Data conference on March 23 in New York City. Yahoo and Cloudera will be among the dozens of big data vendors and thought leaders presenting.

Related content from GigaOM Pro (sub req’d):