Been there, forked that: What the Unix-Linux schism can teach us about Hadoop’s future

Hadoop is fast becoming the preferred way to store and process big data. By T-System’s estimates, in five years, 80 percent of all new data will first land in Hadoop’s distributed file system (HDFS) or in alternative Object Storage architectures.
Yet with the excitement around this open source framework, enterprise users risk overlooking that all Hadoop flavors are not created equal. Choosing one implementation over another can mean veering off the path of genuine open source software and instead heading down the dead-end street of expensive vendor lock-in and stunted innovation.

A little history lesson

The enterprise tech world has been there before. Remember the Unix vs. Linux schism? The former started as a project at Bell Labs and UC Berkeley in the 1970s. Unix was acclaimed for its performance, stability and scalability. It was cutting-edge back then when it came to multi-user and multitasking capabilities, support of IP networks, tools and GUI.
Unix also was a cash cow large software vendors desperately wanted to milk. They developed powerful, yet proprietary versions of Unix during the 1980s. The list of derivatives is long and comprises HP-UX by Hewlett-Packard (s hpq), DG/UX by Data General, AIX by IBM (s ibm), IRIX by Silicon Graphics, as well as SUN’s Solaris (s orcl). The consequences were a fragmentation of Unix that held users captive once they’d settled on a flavor. Once switching becomes difficult, painful and expensive, it stunts innovation.
Linux followed a different path. This open source operating system has thrived ever since it was born in the early 1990s, thanks to a global community of developers. It quickly caught up to its older, more established rival in terms of performance and feature set because it was not trapped in proprietary silos. And it handily beat Unix in terms of capital expenses and operating costs. Since Linux runs on off-the-shelf hardware, it has important similarities to today’s Hadoop world of inexpensive commodity hardware.

Hadoop as the OS for big data

Fast forward to 2013, and you’ll see quite a few lessons that apply to Hadoop and enterprise customers who are thinking about implementing it to meet their big data needs.
Hadoop is like the emerging “operating system” for big data. It enables an organization to use distributed hardware resources to store and compute on large data sets. The world of big data and its appropriate frameworks and tools changes fast, and just like Linux, Hadoop thrives because of a community of hundreds of developers who put their time, resources and passion into making it better all the time.
Systems integrators, service providers and enterprise customers are well advised to carefully pick Hadoop distributions that are truly open source, with room to grow as the technology keeps maturing. One example is Palo Alto-based startup Hortonworks, host of the sixth annual Hadoop Summit being held this week in San Jose. Another example, though with some functionality that is for now held-back from true open source licensing, is Cloudera.
At the other end of the spectrum are vendors that only pay lip service to Hadoop. They embrace and extend it with their own proprietary twists and in the process create fragmentation. While I cannot name any particular vendor because of my professional obligations, you can imagine this category to include some of the large-cap IT vendors who are now muscling into big data.
As the big data technologist for Deutsche Telekom, possibly Silicon Valley’s single largest European customer of IT products and services, I’ve seen enough Hadoop offerings and implementations to suggest a quick, two-step reality check.

How to test your Hadoop distro’s open source street cred

Step one: Can you open the hood and see the engine? Free software doesn’t equal open source. What might be free today can become a costly piece of software once the vendor realizes he is sitting on a gold mine and raises prices, knowing his users are locked in. Open source, on the other hand, means the source code is open for everybody to see and complies with one of several libraries for open source license (see for instance Open source means you can look under the hood, tinker with, enhance and maintain your big data “operating system”.
Step two: The second telltale sign a vendor really means it when he proclaims his love for Hadoop is this: do they have skin in the game of give and take? The Hadoop community breaks into three types called reviewers, contributors and committers. The latter are the seasoned members who set the roadmap, coordinate development and make sure all the pieces eventually click and run. If you want to know whether a vendor really means it, check how many of his employees are Hadoop committers.
If you don’t see a significant number of committers in a company’s ranks, you can be fairly certain that this particular vendor is just going through the motions with Hadoop. They may be checking a mandatory list of features and tools to lure customers to their version of Hadoop, but are most likely pursuing a strategy of forking a great idea for their own gain.
Do your organization and the value hidden in your datasets a favor and don’t become a Hadoop hostage. You risk paying ransom for years or even decades to come while missing out on waves of innovation that will make the entire economy relentlessly data-driven.
Picking the open source or the proprietary path for Hadoop is not just a decision every IT department has to make for itself. Taken together, these decisions will determine whether Hadoop will go the way of Unix or Linux, become a lock-in legacy or break-out success. I’m personally rooting for the latter option because Hadoop is a powerful framework with lots of promise.
Juergen Urbanski, VP Big Data Architectures & Technologies at T-Systems, the enterprise arm of Deutsche Telekom. He writes here in a personal capacity.