Yahoo Launches New Secure, Smarter Hadoop

Yahoo (s yhoo) is taking advantage of its annual Hadoop Summit today by rolling out some new features for the open-source file system distribution that it created for handling huge amounts of data. The new features tackle security and workflow management, two areas that Yahoo believes need to improve as Hadoop continues its proliferation among mainstream users. But will Yahoo’s features make it harder for startups like Cloudera and Karmasphere to earn a living?

On the security front, Yahoo has integrated the Kerberos authentication standard into its distribution, resulting in the aptly named Hadoop with Security. This lets users consolidate data from multiple applications onto the same Hadoop cluster, while limiting access to each class of data only to authorized users. This isn’t a mainstream problem yet, but because of its large Hadoop infrastructure –- 34,000 servers and 170 petabytes of data spread across the globe -– Shelton Shugar, SVP of cloud computing at Yahoo, thinks his company is “probably at the forefront of running into this [problem].” He adds that it will become a big issue for enterprises as their usage expands in scope beyond small development teams and single applications.

The other newly available download is a workflow-management tool called Oozie, which Shugar calls the “elephant tamer.” Oozie should be in high demand from users outside Yahoo because it lets them manage and maintain a variety of different Hadoop job types and data dependencies without writing their own applications to do so. Shugar says it’s the de facto tool for extract, transform, load, or ETL, processing at Yahoo.

Both of these Yahoo innovations beg the question of how the Hadoop market will play out. Cloudera offers its own commercial Hadoop distribution and support services, and plans to release proprietary products in the near future. Karmasphere offers a desktop-based product for building, deploying and managing Hadoop applications. Other startups, like Datameer, are incorporating Hadoop into the guts of business intelligence products without requiring the user to learn any Hadoop programming.

There currently is a market for value-added commercial products (GigaOM Pro sub req’d), for Hadoop, but one wonders whether first-time users are more likely to pay for Hadoop software or experiment with Yahoo’s growing set of free tools (which actually might end up in commercial distributions, too). Shugar says Yahoo is investing serious resources into balancing CPU and storage requirements to maximize infrastructure usage in the face of skyrocketing storage needs, and is also looking to improve internal programmer support to help get data in and out of Hadoop via metadata.

As more Yahoo software makes its way into the Apache Hadoop community, and big data analysis requirements grow, it might be difficult to justify paying for value-added solutions rather than just downloading the increasingly feature-packed Yahoo distribution and learning Hadoop development. Should the startups building their business around Hadoop worry?

Image courtesy of Flickr user Erik Eldridge