The history of Hadoop: From 4 nodes to the future of data

Explained Bisciglia:

“It was somewhat challenging as an interviewer when you’re talking to these undergrads who are really smart kids and you ask them to come up with an algorithm or do some data structures and then you say, ‘Well, what would you do with a thousand times as much data?’ and they just go blank. And it’s not because they’re not smart, it’s just, well, what context do they really have to think about it at that scale?”

[soundcloud url=”” params=”” width=” 100%” height=”166″ iframe=”true” /]

Only, the course didn’t really scale because it wasn’t feasible to deploy Hadoop clusters at universities across the country. So Google teamed with IBM and the National Science Foundation to buy a soon-to-be decommissioned data center, install 2,000 Hadoop nodes in it, and offer up grants to researchers and universities instead. Managing this cluster made Bisciglia realize how hard it was to manage Hadoop at any real scale, and how much he wished there was someone he could call to help.

“That was kind of when the light went off that that company didn’t exist and it needed to be started,” he said.

Then-Google engineer Bisciglia (second from right) at Structure 2008.

Then-Google engineer Bisciglia (second from right) at Structure 2008.

… to the enterprise

With Google’s blessing, Bisciglia spent time thinking about his idea and incorporated a company called Cloudera in March 2008. He reconnected with open-source acquaintance and now-Cloudera CEO Mike Olson shortly thereafter and the two took the idea to Accel Partners, where Ping Li connected them with former Facebook data engineer Jeff Hammerbacher and former Yahoo engineering VP Amr Awadallah, both of whom were doing entrepreneur-in-residence stints. The four of them officially founded Cloudera in August 2008 and it closed its first funding round in April 2009.

Feeling he had fulfilled his mission at Yahoo and drawn by the prospect of helping mainstream companies adopt Hadoop as a general-purpose computing platform, Cutting left Yahoo for Cloudera in August 2009 (although he remains chairman of the Apache Software Foundation and a member of the Apache Hadoop project management committee). Bisciglia would leave Cloudera in 2010 and ultimately co-found another Hadoop-based company called Wibidata, which focuses on building analytic applications atop the platform.

[soundcloud url=”” params=”” width=” 100%” height=”166″ iframe=”true” /]

Cloudera was the first commercial Hadoop company and the vehicle through which many CIOs and other non-web-engineers were first introduced to Hadoop. Cloudera’s attention from investors and the technology press helped propel Hadoop into its status as a darling of the IT world, even beyond the early adopter web companies (such as Facebook, which invented the popular SQL-like Hive framework for Hadoop) and into the world’s largest enterprises. “I think every large company is at least experimenting with it somewhere,” Baldeschwieler said, citing increased interest in fields such as health care, telecommunications, biotech and retail.

From left to right: Olson, Awadallah, Bisciglia, Hammerbacher

From left to right: Olson, Awadallah, Bisciglia, Hammerbacher

All this attention on Hadoop and Cloudera hasn’t been lost on the rest of the IT industry, and as more companies became interested in using it, more software vendors signed up to help them. There are now all sorts of startups building higher-level applications, frameworks and management software for Hadoop. Massive companies such as EMC (e emc) and Intel (s emc) are pitching their own Hadoop distributions backed by their massive budgets and tied into their legacy businesses. Everyone selling any type of database, business intelligence software or anything else related to data at least connects to Hadoop in some capacity.

The most-significant company to emerge from all this activity was Hortonworks. Yahoo spun off Hortonworks as a separate company in June 2011 in order to capitalize on its Hadoop skill set and to serve as a self-sustaining entity that would build Hadoop software, sell Hadoop services and overtake Yahoo’s role as the unofficial steward of the Apache Hadoop project. Baldeschwieler was its co-founder and initial CEO, and he brought much of his team with him.

Those involved with forming Hortonworks insist it was more about creating a win-win for Yahoo and the entire Hadoop-using world than it was just about making money. Stata, who shepherded the spinoff before leaving Yahoo in 2012, said there was a real danger of Yahoo’s core — and very important — team getting split up as every Hadoop startup under the sun offered them high salaries and 2 percent of their companies. Yahoo couldn’t offer them the “moral equivalent” of that situation, he explained, but Hortonworks could.

[soundcloud url=”” params=”” width=” 100%” height=”166″ iframe=”true” /]

Aside from offering up another Hadoop distribution for the world to choose from — this one entirely open source — Hortonworks has proven a willing partner to many IT vendors trying to build respectable Hadoop businesses of their own. It works closely with companies such as Microsoft, Teradata and Rackspace to develop their internal Hadoop knowledge and to create products with Hortonworks’ technology at their core.

But running Hadoop is hard for mere mortals

But there’s an elephant in the room (actually, two — and no pun intended) when it comes to Hadoop’s future among mainstream IT shops. One is that even companies that want Hadoop because they’ve heard they need it can’t always think of applications for it. Another is that large, distributed systems are complex and many traditional companies want infrastructure that meets their requirements around things such as security and reliability but that can be managed without an entire team of people.

Speaking with guys who cut their teeth on Hadoop within large web companies, the scope of this latter challenge becomes pretty clear. If you’re managing something like a 42,000-node Hadoop cluster at Yahoo, Stata said, “you do it in your sleep and you don’t even think about it.” There’s not much concern about Hadoop failing or other operational concerns because it’s always easy enough to fix — often times within minutes.

“But if you’re a corporate IT shop, you want turnkey,” Stata said. “You want to set it up and it never comes down.”

Ben Werther, Todd Papaioannou, Dwight Merriman, Mike Hoskins and Awadallah talk enterprise Hadoop at Structure: Data 2011

L to R: Ben Werther, Todd Papaioannou, Dwight Merriman, Mike Hoskins and Awadallah talk about making big data easier at Structure: Data 2011. Source: Pinar Ozger

For example, Baldeschwieler said, Yahoo is so good at managing Hadoop and so confident in its employees’ skills that one administrator will manage several thousands of nodes. In a corporate IT situation, he added, it might take three specialists just to manage a 10-node cluster.

“[In a web company,] you don’t think a lot about how much work it is to glue things together and make them work smoothly. You’re willing to just throw some engineers at that,” Cutting explained. “Most companies want things to work a little more straightforwardly out of the box. So you need a lot more glue, a lot more polish, a lot more documentation.”

And that’s what companies like Cloudera and Hortonworks spend most of their time working on. It’s a good thing they have guys like Cutting on board. “What excites me is the adoption of software,” he said. “I’m happy to do things that people might think are boring technologically … if they help get things adopted and get people using more technology.”

“Uh, I need a Hadoop …”

Once they sell the IT department on Hadoop, though (or if companies decide to use cloud-based versions such as Amazon Elastic MapReduce to avoid management altogether), the next challenge is helping companies find meaningful applications for the technology. Sometimes, Hadoop proves itself as a developer’s guerrilla project on a few spare servers and companies come in wanting to scale that effort. However, Baldeschwieler joked, “When somebody tells me ‘I’m very interested in big data and I think we can find more information in our data,’ that’s going to be a long sales cycle.”

At that point, his team will look for easy opportunities that can visibly save a company money but “don’t require re-engineering their universes.” One major retailer brought in Hadoop for the relatively pedestrian task of offloading data from its expensive storage system, Baldeschwieler explained, and once it had seen Hadoop in action decided to put it to work analyzing customer data to build better models about their shopping behavior.

It’s the Yahoo situation all over again, where a 10-node cluster grows to dozens and then someone will call and say “we’re going to 750 nodes,” he said.

Still, with the exception of certain business-intelligence products and efforts to bring familiar SQL functionality to Hadoop, thinking up and developing applications usually involves a long engagement process. “I think it’s still a couple years out before you can buy a generic application that runs on Hadoop,” Bisciglia said. However, there are some startups — such as Opower in energy management, Apixio in health care and PacketLoop in web security — that are trying to solve this problem with cloud-based services that run Hadoop under the covers without users ever knowing the difference.

Bisciglia’s company Wibidata helps companies analyze their user data to build things such as recommendation engines and fraud-detection systems, and it relies on a NoSQL database called HBase that runs on the Hadoop Distributed File System. Because it helps bring transactional processing to a Hadoop platform built for batch processing, Bisciglia said, “HBase is gonna be what takes Hadoop from an ETL and BI platform into a real-time application platform.”

Hadoop almost never was …

As big as it has become, though, Hadoop almost never came to be. According to then-Yahoo search division boss Qi Lu (who’d eventually become EVP of the Search and Advertising Technology Group at Yahoo and is now president of Microsoft’s online services division), he saw Google’s Jeff Dean present the MapReduce paper at the OSDI conference in 2005 and decided Yahoo needed its own open source version of Google’s search infrastructure. The aim of this effort, said Baldeschwieler — who was then chief architect for development and web search and whose team was tasked with building it — was to make Yahoo a place where skilled data scientists and engineers would want to come work.

Around the same time, then-Yahoo Chief Scientist Jan Pederson and Stata (who was then chief architect for algorithmic web search), got to talking and decided they should hire Cutting and focus on Nutch as Yahoo’s MapReduce strategy. Stata, who was on the Nutch Foundation board of directors, was particularly familiar with the technology and with Cutting, and he was eventually able to convince Lu this was the right plan of attack.

State (foreground) relaxing while judging a Yahoo hack day in 2007. Source: Yodel Anecdotal

State (foreground) relaxing while judging a Yahoo hack day in 2007. Source: Yodel Anecdotal

Lu, however, still had to take the case to the executive board at Yahoo, which he said wasn’t too keen on the prospect of signing away the company’s intellectual property to an open source project. He credits then-colleague and Yahoo executive Jeff Weiner (now CEO at LinkedIn (s lnkd)) with helping him convince the board to give the idea a chance. Later on, it was Lu who thrust Hadoop into the fire of Yahoo’s money-making workloads before it was fully baked, opting for increased revenue over technology maturity and betting the heavy use would make Hadoop better faster.

And even with Hadoop established and Cutting on board, Baldeschwieler and his team were a rightfully confident bunch and didn’t want to abandon the big data stack they were working on. The way Stata tells it, he first convinced Baldeschwieler to get on board with Hadoop, and then the two played good cop, bad cop with the rest of the team. Stata would try to push Cutting’s project as a mandate from above, then Baldeschwieler would go back to his team with a position like “Hey, I know we can build something better, but …”

“Eric, I think, was a willing partner in that,” Stata joked, and ultimately they decided to merge their efforts with Cutting’s work as the focal point.

[soundcloud url=”” params=”” width=” 100%” height=”166″ iframe=”true” /]

Now, with so much money at stake for the companies trying to sell Hadoop commercially, the big fear is a fragmentation of Hadoop similar to what happened to the Unix operating system in the 1990s. Already, Cloudera, Hortonworks, EMC Greenplum, MapR, Intel and IBM (s ibm) (and, in a way, Amazon Web Services (s amzn)) sell foundational-level Hadoop distributions that vary from one another and from Apache Hadoop in significant ways. Facebook, Twitter, Quantcast and other web companies also do their own open source work and often release it into open source via Github.

The companies involved haven’t always done their best to quell these fears. Hortonworks and Cloudera have engaged in very public disputes over who the real open source champion is and whose employees have been more integral to the development of Apache Hadoop. And both of those companies are quick to point the finger at other companies they claim are doing Hadoop a disservice by developing proprietary software at the MapReduce and HDFS levels.

A chart from Hortonworks in 2011, part of a back and forth between the company and Cloudera

A chart from Hortonworks in 2011, part of a back and forth between the company and Cloudera

Cutting, who has the unenviable task of being chief architect at Cloudera while also serving as chairman of the Apache Software Foundation, tries to keep a level perspective. “My personal take is that there’s a big enough pie here that we can focus on growing the size of the pie rather than fighting over our slice of it,” he said. “… There is a fair amount of effort wasted on backstabbing each other. I try to stay out of it mostly — not always successfully

[soundcloud url=”″ params=”” width=” 100%” height=”166″ iframe=”true” /]

… but always will be

Everyone seems to agree that the trick to keeping Hadoop from fragmenting is making sure everything worth doing finds its way back to Apache.

“You have people like Facebook or Quantcast who are willing to push the envelope and try some things well ahead of everybody else in various directions,” Baldeschwieler said, “and that’s a great contribution to Hadoop itself because some of their ideas work really well and they talk about them.” Even competitors help inform each others’ — and ultimately the Apache project’s — road maps. “That’s our laboratory,” he added.

Cutting agrees, and he’s confident Hadoop can overcome any political struggles and become the world’s de facto platform for storing large amounts of data. “We’re past the teething stage by a long shot,” Cutting said. “We’re still not, obviously, in every industrial data center. My prediction — I don’t make a lot of predictions — is that that’s where we’re headed. I don’t see any major obstacles in the way.”

[soundcloud url=”” params=”” width=” 100%” height=”166″ iframe=”true” /]

And while the people who built Hadoop are now busy trying to make it usable by the whole world, their progenies at companies like Yahoo and Facebook will keep pushing the limits of what it can do.

“In terms of the torture test for Hadoop, Yahoo remains that. There’s more interesting and diverse applications there than any of our other customers,” Baldeschwieler said. “… Ultimately, people don’t adopt Hadoop because it’s the best solution for processing small data. They adopt Hadoop because it’s demonstrated that it solves the biggest problems and they won’t outgrow it.”