Report: Extending Hadoop Towards the Data Lake

Our library of 1700 research reports is available only to our subscribers. We occasionally release ones for our larger audience to benefit from. This is one such report. If you would like access to our entire library, please subscribe here. Subscribers will have access to our 2017 editorial calendar, archived reports and video coverage from our 2016 and 2017 events.
Extending Hadoop Towards the Data Lake by Paul Miller:
The data lake has increasingly become an aspect of Hadoop’s appeal. Referred to in some contexts as an “enterprise data hub,” it now garners interest not only from Hadoop’s existing adopters but also from a far broader set of potential beneficiaries. It is the vision of a single, comprehensive pool of data, managed by Hadoop and accessed as required by diverse applications such as Spark, Storm, and Hive, that offers opportunities to reduce duplication of data, increase efficiency, and create an environment in which data from very different sources can meaningfully be analyzed together.
Fully embracing the opportunity promised by a comprehensive data lake requires a shift in attitude and careful integration with the existing systems and workflows that Hadoop often augments rather than replaces. Existing enterprise concerns about governance and security will certainly not disappear, so suitable workflows must be developed to safeguard data while making it available for newly feasible forms of analysis.
Early adopters in a range of industries are already finding ways to exploit the potential of their data lakes, operationalizing internal analytic processes and integrating rich real-time analyses with more established batch processing tasks. They are integrating Hadoop into existing organizational workflows and addressing challenges around the completeness, cleanliness, validity, and protection of their data.
In this report, we explore a number of the key issues frequently identified as significant in these successful implementations of a data lake.
To read the full report, click here.

Cloudera CEO declares victory over big data competition

Cloudera CEO Tom Reilly doesn’t often mince words when it comes to describing his competition in the Hadoop space, or Cloudera’s position among those other companies. In October 2013, Reilly told me he didn’t consider Hortonworks or MapR to be Cloudera’s real competition, but rather larger data-management companies such as IBM and EMC-VMware spinoff Pivotal. And now, Reilly says, “We declare victory over at least one of our competitors.”

He was referring to Pivotal, and the Open Data Platform, or ODP, alliance it helped launched a couple weeks ago along with [company]Hortonworks[/company], [company]IBM[/company], [company]Teradata[/company] and several other big data vendors. In an interview last week, Reilly called that alliance “a ruse and, frankly, a graceful exit for Pivotal,” which laid off a number of employees working on its Hadoop distribution and is now outsourcing most of its core Hadoop development and support to Hortonworks.

You can read more from Reilly below, including his takes on Hortonworks, Hadoop revenues and Spark, as well as some expanded thoughts on the ODP. For more information about the Open Data Platform from the perspectives of the members, you can read our coverage of its launch in mid-February as well as my subsequent interview with Hortonworks CEO Rob Bearden, who explains in some detail how that alliance will work.

If you want to hear about the fast-changing, highly competitive and multi-billion-dollar business of big data straight from horses’ mouths, make sure to attend our Structure Data conference March 18 and 19 in New York. Speakers include Cloudera’s Reilly and Hortonworks’ Bearden, as well as MapR CEO John Schroeder, Databricks CEO (and Spark co-creator) Ion Stoica, and other big data executives and users, including those from large firms such as [company]Lockheed Martin[/company] and [company]Goldman Sachs[/company].


You down with ODP? No, not me

While Hortonworks explains the Open Data Platform essentially as a way for member companies to build on top of Hadoop without, I guess, formally paying Hortonworks for support or embracing its entire Hadoop distribution, Reilly describes it as little more than a marketing ploy. Aside from calling it a graceful exit for Pivotal (and, arguably, IBM), he takes issue with even calling it “open.” If the ODP were truly open, he said, companies wouldn’t have to pay for membership, Cloudera would have been invited and, when it asked about the alliance, it wouldn’t have been required to sign a non-disclosure agreement.

What’s more, Reilly isn’t certain why the ODP is really necessary technologically. It’s presently composed of four of the most mature Hadoop components, he explained, and a lot of companies are actually trying to move off of MapReduce (to Spark or other processing engines) and, in some cases, even the Hadoop Distributed File System. Hortonworks, which supplied the ODP core and presumably will handle much of the future engineering work, will be stuck doing the other members’ bidding as they decide which of several viable SQL engines and other components to include, he added.

“I don’t think we could have scripted [the Open Data Platform news] any better,” Reilly said. He added, “[T]he formation of the ODP … is a big shift in the landscape. We think it’s a shift to our advantage.”

(If you want a possibly more nuanced take on the ODP, check out this blog post by Altiscale CEO Raymie Stata. Altiscale is an ODP member, but Stata has been involved with the Apache Software Foundation and Hadoop since his days as Yahoo CTO and is a generally trustworthy source on the space.)

Hortonworks CEO Rob Bearden at Structure Data 2014.

Hortonworks CEO Rob Bearden at Structure Data 2014.

Really, Hortonworks isn’t a competitor?

Asked about the competitive landscape among Hadoop vendors, Reilly doubled down on his assessment from last October, calling Cloudera’s business model “a much more aggressive play [and] a much bolder vision” than what Hortonworks and MapR are doing. They’re often “submissive” to partners and treat Hadoop like an “add-on” rather than a focal point. If anything, Hortonworks has burdened itself by going public and by signing on to help prop up the legacy technologies that IBM and Pivotal are trying to sell, Reilly said.

Still, he added, Cloudera’s “enterprise data hub” strategy is more akin to the IBM and Pivotal business models of trying to become the centerpiece of customers’ data architectures by selling databases, analytics software and other components beside just Hadoop.

If you don’t buy that logic, Reilly has another argument that boils down to money. Cloudera earned more than $100 million last year (that’s GAAP revenue, he confirmed), while Hortonworks earned $46 million and, he suggested, MapR likely earned a similar number. Combine that with Cloudera’s huge investment from Intel in 2014 — it’s now “the largest privately funded enterprise software company in history,” Reilly said — and Cloudera owns the Hadoop space.

“We intend to take advantage” of this war chest to acquire companies and invest in new products, Reilly said. And although he wouldn’t get into specifics, he noted, “There’s no shortage of areas to look in.”

Diane Bryant, senior vice president and general manager of Intel's Data Center Group, at Structure 2014.

Diane Bryant, senior vice president and general manager of Intel’s Data Center Group, at Structure 2014.

The future is in applications

Reilly said that more than 60 percent of Cloudera sales are now “enterprise data hub” deployments, which is his way of saying its customers are becoming more cognizant of Hadoop as an application platform rather than just a tool. Yes, it can still store lots of data and transform it into something SQL databases can read, but customers are now building new applications for things like customer churn and network optimization with Hadoop as the core. Between 15 and 20 financial services companies are using Cloudera to power detect money laundering, he said, and Cloudera has trained its salesforce on a handful of the most popular use cases.

One of the technologies helping make Hadoop look a lot better for new application types is Spark, which simplifies the programming of data-processing jobs and runs them a lot faster than MapReduce does. Thanks to the YARN cluster-management framework, users can store data in Hadoop and process it using Spark, MapReduce and other processing engines. Reilly reiterated Cloudera’s big investment and big bet on Spark, saying that he expects a lot of workloads will eventually run on it.

Databricks CEO (and AMPLab co-director) Ion Stoica.

Databricks CEO (and Spark co-creator) Ion Stoica.

A year into the Intel deal and …

“It is a tremendous partnership,” Reilly said.

[company]Intel[/company] has been integral in helping Cloudera form partnerships with companies such as Microsoft and EMC, as well as with customers such as MasterCard, he said. The latter deal is particularly interesting because Cloudera and Intel’s joint engineering on hardware-based encryption helped Cloudera deploy a PCI-compliant Hadoop cluster and MasterCard is now out pushing that system to its own clients via its MasterCard Advisors professional services arm.

Reilly added that Cloudera and Intel are also working together on new chips designed specifically for analytic workloads, which will take advantage of non-RAM memory types.

Asked whether Cloudera’s push to deploy more workloads in cloud environments is at odds with Intel’s goal to sell more chips, Reilly pointed to Intel’s recent strategy of designing chips especially for cloud computing environments. The company is operating under the assumption that data has gravity and that certain data that originates in the cloud, such as internet-of-things or sensor data, will stay there, while large enterprises will continue to store a large portion of their data locally.

Wherever they run, Reilly said, “[Intel] just wants more workloads.”

Hortonworks did $12.7M in Q4, on its path to a billion, CEO says

Hadoop vendor Hortonworks announced its first quarterly earnings as a publicly held company Tuesday, claiming $12.7 million in fourth-quarter revenue and $46 million in revenue during fiscal year 2014. The numbers represent 55 percent quarter-over-quarter and 91 percent year-over-year increases, respectively. The company had a net loss of $90.6 million in the fourth quarter and $177.3 million for the year.

However, [company]Hortonworks[/company] contends that revenue is not the most important number in assessing its business. Rather, as CEO Rob Bearden explained around the time the company filed its S-1 pre-IPO statement in November, Hortonworks’ thinks its total billings are a more accurate representation of its health. That’s because the company relies fairly heavily on professional services, meaning the company often doesn’t get paid until a job is done.

The company’s billings in the fourth quarter totaled $31.9 million, a 148 percent year-over-year increase. Its fiscal year billings were $87.1 million, a 134 percent increase over 2013.

If you buy Bearden’s take on the importance of billings over revenue, then Hortonworks looks a lot more comparable in size to its largest rival, Cloudera. Last week, Cloudera announced more than $100 million in revenue in 2014, as well as an 85 percent increase in subscription software customers up to 525 in total.

Hortonworks, for its part, added 99 customers paying for enterprise support of its Hadoop platform in the fourth quarter alone, bringing its total to 332. Among those customers are Expedia, Macy’s, Blackberry and Spotify, all four of which moved directly to Hortonworks from Cloudera, a Hortonworks spokesperson said.

There are, however, some key differences between the Hortonworks and Cloudera business models, as well as that of fellow vendor MapR, that affect how comparable any of these metrics really are. While Hortonworks is focused on free open source software and relies on support contracts for revenue, Cloudera and MapR offer both free Hadoop distributions as well as more feature-rich paid versions. In late-2013, Cloudera CEO Tom Reilly told me his company was interested in securing big deployments rather than chasing cheap support contracts.

Rob Bearden

Rob Bearden at Structure Data 2014

I had a broad discussion with Bearden last week about the Hadoop market and some of Hortonworks’ recent moves in that space, including the somewhat-controversial Open Data Platform alliance it helped to create along with Pivotal, [company]IBM[/company], [company]GE[/company] and others. Here are the highlights from that interview. (If you want to hear more from Bearden and perhaps ask him some of your own questions, make sure to attend our Structure Data conference March 18 and 19 in New York. Other notable Hadoop-market speakers include Cloudera CEO Tom Reilly, MapR CEO John Schroeder and Databricks CEO (and Spark co-creator) Ion Stoica.)

Explain the rationale behind the Open Data Platform

Bearden wouldn’t comment specifically on criticisms — made most loudly by Cloudera’s Mike Olson and Doug Cutting, as well as some industry analysts — that the Open Data Platform, or ODP, is somehow antithetical to open source or the Apache Software Foundation. “What I would say,” he noted, “is the people who are committed to true open source and an open platform for the community are pretty excited about the thing.”

He also chalked up a lot of the criticism of the ODP to misunderstanding about how it really will work in practice. “One of the things I don’t think is very clear on the Open Data Platform alliance is that we’re actually going to provide what we’ll refer to as the core for that alliance, that is based on core Hadoop — so HDFS and YARN and Ambari,” Bearden explained. “We’re providing that, which is obviously directly from Apache, and it’s the exact same bit line that [the Hortonworks Data Platform] is based on.”

Pivotal CEO Paul Maritz at Structure Data 2014.

Paul Maritz, CEO of Hortonworks partner, and ODP member Pivotal, at Structure Data 2014.

So, the core Hadoop distribution that platform members will use is based on Apache code, and anything that ODP members want to add on top of it will also have to go through Apache. These could be existing Apache projects, or they could be new projects he members decide to start on their own, Bearden said.

“We’re actually strengthening the position of the Apache Software Foundation,” he said. He added later in the interview, on the same point, that people shouldn’t view the ODP as much different than they view Hortonworks (or, in many respects, Cloudera or MapR). “[The Apache Software Foundation] is the engineering arm,” he said, “and this entity will become he productization and packaging arm for [Apache].”

So, it’s Cloudera vs. MapR vs. Hortonworks et al?

I asked Bearden whether the formation of the ODP officially makes the Hadoop market a case of Cloudera and MapR versus the Hortonworks ecosystem. That seems like the case to me, considering that the ODP is essentially providing the core for a handful of potentially big players in the Hadoop space. And even if they’re not ODP members, companies such as [company]Microsoft[/company] and [company]Rackspace[/company] have built their Hadoop products largely on top of the Hortonworks platform and with its help.

Bearden wouldn’t bite. At least not yet.

“I wouldn’t say it’s the other guys versus all of us,” he said. “I would say what’s happened is the community has realized this is what they want and it fits in our model that we’re driving very cleanly. . . . And we’re not doing anything up the stack to try and disintermediate them, and we de-risk it because we’re all open.”

The this he’s referring to is the ability of its partners to stop spending resources keeping up with the core Hadoop technology and instead focus on how they can monetize their own intellectual property. “To do that, the more data they put under management, the faster and the more-stable and enterprise-viable [the platform on which they have that data], the faster they monetize and the bigger they monetize the rest of their platform,” Bearden said.

Microsoft CEO Satya Nadella speaks at a Microsoft cloud event. Photo by Jonathan Vanian/Gigaom

Microsoft CEO Satya Nadella speaks at a Microsoft cloud event about that company’s newfound embrace of open source.

Are you standing by your prediction of a billion-dollar company?

“I am not backing off that at all,” Bearden said, in reference to his prediction at Structure Data last year that Hadoop will soon become a multi-billion-dollar market and Hortonworks will be a billion-dollar company in terms of revenue. He said it’s fair to look at revenue alone is assessing the businesses in this space, but it’s not the be all, end all.

“It’s less about [pure money] and more about what is the ecosystem doing to really start adopting this,” he said. “Are they trying fight it and reject it, or are they really starting to embrace it and pull it through? Same with the big customers. . . .
“When those things are happening, the money shows up. It just does.”

Hadoop is actually just a part — albeit a big one — of a major evolution in the data-infrastructure space, he explained. And as companies start replacing the pieces of their data environments, they’ll do so with the open source options that now dominate new technologies. These include Hadoop, NoSQL databases, Storm, Kafka, Spark and the like.

In fact, Bearden said, “Open source companies can be very successful in terms of revenue growth and in terms of profitability faster than the old proprietary platforms got there.”

Time will tell.

Update: This post was updated at 8:39 p.m. PT to correct the amount of Hortonworks’ fourth quarter revenue and losses. Revenue was $12.7 million, not $12.5 million as originally reported, and losses were $90.6 million for the quarter and $177.3 million for the year. The originally reported numbers were for gross loss.

Microsoft’s machine learning guru on why data matters sooooo much

[soundcloud url=”″ params=”color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false” width=”100%” height=”166″ iframe=”true” /]

Not surprisingly, Joseph Sirosh, has big ambitions for his product portfolio at Microsoft which includes Azure ML, HDInsight and other tools. Chief among them is making it easy for mere mortals to consume these data services from the applications they’re familiar with. Take Excel for example.

If a financial analyst can, with a few clicks, send data to a forecast service in the cloud, then get the numbers back, visualized on the same spreadsheet, that’s a pretty powerful story, said Sirosh who is corporate VP of machine learning for Microsoft.

But as valuable as those applications and services are, more and more of the value to be derived from computation over time will be the data itself, not all those tech underpinnings.  “In the future a huge part of the value generated from computing will come from the data as opposed to storage and operating systems and basic infrastructure,” he noted on this week’s podcast. WHich is why one topic under discussion at next month’s Structure Data show will be who owns all the data flowing betwixt and betweeen various systems, the internet of things etc.

When it comes to getting corporations running these new systems [company]Microsoft[/company] may have an ace in the hole because so many of them already use key Microsoft tools — Active Directory, SQL Server, Excel. That gives them a pretty good on-ramp to Microsoft Azure and its resident services. Sirosh makes a compelling case and we’ll talk to him more on stage at Structure Data next month in New York City.

In the first half of the show, Derrick Harris and I talk about the Hadoop world has returned to its feisty and oh so interesting roots. When Pivotal announced its plan to offload support of Hadoop to [company]Hortonworks[/company] and work with that company along with [company]IBM[/company], [company]GE[/company] on  the Open Data Platform the response from Cloudera CEO Mike Olsen in a blog post with his take. 

Also on the docket, @WalmartLabs massive OpenStack production private cloud implementation.

Joesph Sirosh

Joseph Sirosh



Hosts: Barb Darrow and Derrick Harris.

Download This Episode

Subscribe in iTunes

The Structure Show RSS Feed


No, you don’t need a ton of data to do deep learning 

VMware wants all those cloud workloads “marooned” in AWS

Don’t like your cloud vendor? Wait a second.

Hilary Mason on taking big data from theory to reality

On the importance of building privacy into apps and Reddit AMAs

Cloudera claims more than $100M in revenue in 2014

Hadoop vendor Cloudera announced on Tuesday that the company’s “[p]reliminary unaudited total revenue surpassed $100 million” in 2014. That the company, which is still privately held, would choose to disclose even that much information about its finances speaks to the fast maturation, growing competition and big egos in the Hadoop space.

While $100 million is a nice, round benchmark number, the number by itself doesn’t mean much of anything. We still don’t know how much profit Cloudera made last year or, more likely, how big of a loss it sustained. What we do know, however, is that it earned more than bitter rivals [company]Hortonworks[/company] (it claimed $33.4 million through the first nine months of 2014, and will release its first official earnings report next week) and probably MapR (I’ve reached out to MapR about this and will update this if I’m wrong). However, Cloudera claims 525 customers are paying for its software (an 85 percent improvement since 2013), while MapR in December claimed more than 700 paying customers.

Cloudera also did about as much business as EMC-VMware spinoff Pivotal claims its big data business did in 2014. On Tuesday, Pivotal open sourced much of its Hadoop and database technology, and teamed up with Hortonworks and a bunch of software vendors large and small to form a new Hadoop alliance called the Open Data Platform. Cloudera’s Mike Olson, the company’s chief strategy officer and founding CEO, called the move, essentially, disingenuous and more an attempt to save Pivotal’s business than a real attempt to advance open source Hadoop software.

Hortonworks CEO Rob Bearden at Structure Data 2014.

Hortonworks CEO Rob Bearden at Structure Data 2014.

All of this grandstanding and positioning is part of a quest to secure business in a Hadoop market that analysts predict will be worth billions in the years to come, and also an attempt by each company to prove to potential investors that its business model is the best. Hortonworks surprised a lot of people by going pubic in December, and the stock has remained stable since then (although its share price dropped more than two percent on Tuesday despite the news with Pivotal). Many people suspect Cloudera and MapR will go public this year, and Pivotal at some point as well.

This much action should make for an entertaining and informative Structure Data conference, which is now less than a month away. We’ll have the CEOs of Cloudera, Hortonworks and MapR all on stage talking about the business of big data, as well as the CEO of Apache Spark startup Databricks, which might prove to be a great partner for Hadoop vendors as well as a thorn in their sides. Big users, including Goldman Sachs, ESPN and Lockheed Martin, will also be talking about the technologies and objectives driving their big data efforts.

Pivotal open sources its Hadoop and Greenplum tech, and then some

Pivotal, the cloud computing and big data company that spun out from EMC and VMware in 2013, is open sourcing its entire portfolio of big data technologies and is teaming up with Hortonworks, IBM, GE, and several other companies on a Hadoop effort called the Open Data Platform.

Rumors about the fate of the company’s data business have been circulating since a round of layoffs began in November, but, according to Pivotal, the situation isn’t as dire as some initial reports suggested.

There is a lot of information coming out of the company about this, but here are the key parts:

  • Pivotal is still selling licenses and support for its Greenplum, HAWQ and GemFire database products, but it is also releasing the core code bases for those technologies as open source.
  • Pivotal is still offering its own Hadoop distribution, Pivotal HD, but has slowed development on core components of MapReduce, YARN, Ambari and the Hadoop Distributed File System. Those four pieces are the starting point for a new association called the Open Data Platform, which includes Pivotal, [company]GE[/company], [company]Hortonworks[/company], [company]IBM[/company], Infosys, Pivotal, SAS, Altiscale, [company]EMC[/company], [company]Verizon[/company] Enterprise Solutions, [company]VMware[/company], [company]Teradata[/company] and “a large international telecommunications firm,” and which promises to build its Hadoop technologies using a standard core of code.
  • Pivotal is working with Hortonworks to make Pivotal’s big data technologies run on the Hortonworks Data Platform, and eventually on the Open Data Platform core. Pivotal will continue offering enterprise support for Pivotal HD, although it will outsource to Hortonworks support requests involving the guts of Hadoop (e.g., MapReduce and HDFS).

Sunny Madra, vice president of the data and mobile product group at Pivotal, said the company has a relatively successful big data business already — $100 million overall, $40 million of which came from the Big Data Suite license bundle it announced last year — but suggested that it sees the writing on the wall. Open source software is a huge industry trend, and he thinks pushing against it is as fruitless as pushing against cloud computing several years ago.

“We’re starting to see open source pop up as an RFP within enterprises,” he said. “. . . If you’re picking software [today] . . . you’d look to open source.”


The Pivotal Big Data Suite.

Madra pointed to Pivotal’s revenue numbers as proof the company didn’t open source its software because no one wanted to pay for it. “We wouldn’t have a $100 million business . . . if we couldn’t sell this,” he said. Maybe, but maybe not: Hortonworks isn’t doing $100 million a year, but word was that Cloudera was doing it years ago (on Tuesday, Cloudera did claim more than $100 million in revenue in 2014). Depending how one defines “big data,” companies like Microsoft and Oracle are probably making much more money.

However, there were some layoffs late last year, which Madra attributed to consolidation of people, offices and efforts rather than a failing business. Pivotal wanted to close some global offices and bring the data team and Cloud Foundry teams under the same leadership, and to focus its development resources on its own intellectual property around Hadoop. “Do we really need a team going and testing our own distribution?” he asked, troubleshooting it, certifying it against technologies and all that goes along with that?

EMC first launched the Pivotal HD Hadoop distribution, as well as the HAWQ SQL-on-Hadoop engine, with much ado just over two years ago.

The deal with Hortonworks helps alleviate that engineering burden in the short term, and the Open Data Platform is supposed to help solve it over a longer period. Madra explained the goal of the organization as Linux-like, meaning that customers should be able to switch from one Hadoop distribution to the next and know the kernel will be the same, just like they do with the various flavors of the Linux operating system.

Mike Olson, Cloudera’s chief strategy officer and founding CEO, offered a harsh rebuttal to the Open Data Platform in a blog post on Tuesday, questioning the utility and politics of vendor-led consortia like this. He simultaneously praised Hortonworks for its commitment to open source Hadoop and bashed Pivotal on the same issue, but wrote, among other things, of the Open Data Platform: “The Pivotal and Hortonworks alliance, notwithstanding the marketing, is antithetical to the open source model and the Apache way.”

The Pivotal HD and Hawq architecture

Much of this has been open sourced or replaced.

As part of Pivotal’s Tuesday news, the company also announced additions to its Big Data Suite package, including the Redis key-value store, RabbitMQ messaging queue and Spring XD data pipeline framework, as well as the ability to run the various components on the company’s Cloud Foundry platform. Madra actually attributes a lot of Pivotal’s decision to open source its data technologies, as well as its execution, to the relative success the company has had with Cloud Foundry, which has always involved an open source foundation as well as a commercial offering.

“Had we not had the learnings that we had in Cloud Foundry, then I think it would have been a lot more challenging,” he said.

Whether or not one believes Pivotal’s spin on the situation, though, the company is right in realizing that it’s open source or bust in the big data space right now. They have different philosophies and strategies around it, but major Hadoop vendors Cloudera, Hortonworks and MapR are all largely focused on open-source technology. The most popular Hadoop-ecosystem technologies, including Spark, Storm and Kafka, are open source, as well. (CEOs, founders and creators from many of these companies and projects will be speaking at our Structure Data conference next month in New York.)

Pivotal might eventually sell billions of dollars worth of software licenses for its suite of big data products — there’s certainly a good story there if it can align the big data and Cloud Foundry businesses into a cohesive platform — but it probably has reached its plateau without having an open source story to tell.

Update: This post was updated at 12:22 p.m. PT to add information about Cloudera’s revenue.

Exclusive: Pivotal CEO says open source Hadoop tech is coming

Pivotal, the cloud computing spinoff from EMC and VMware that launched in 2013, is preparing to blow up its big data business by open sourcing a whole lot of it.

Rumors of changes began circulating in November, after CRN reported that Pivotal was in the process of laying off about 60 people, many of which worked on the big data products. The flames were stoked again on Friday by a report in VentureBeat claiming the company might cease development of its Hadoop distribution and/or open source various pieces of its database technology such as Greenplum and HAWQ.

Gigaom has confirmed at least part of this is true, via an emailed statement credited to Pivotal CEO Paul Maritz:

“We are anticipating an interesting set of announcements on Feb 17th. However rumors to the effect that Pivotal is getting out of the Hadoop space are categorically false. The announcements, which involve multiple parties, will greatly add to the momentum in the Hadoop and Open Source space, and will have several dimensions that we believe customers will find very compelling.”

Those announcements will take place via webcast.

Paul Maritz at Structure Data 2014. (© Photo by Jakub Mosur).

Paul Maritz at Structure Data 2014.

Multiple external sources have told Gigaom that Pivotal does indeed plan to open source its Hadoop technology, and that it will work with former rival (but, more recently, partner) Hortonworks to maintain and develop it. IBM was also mentioned as a partner.

Members of the Hadoop team were let go around November when active development stopped, the sources said, and some senior big data personnel — including Senior Vice President of R&D Hugh Williams and Chief Scientist Milind Bhandarkar — departed the company in December, according to their LinkedIn profiles. Both of them claim to be working on new startup projects.

When EMC first introduced its Hadoop distribution, called Pivotal HD in February 2013 (it was one of the technologies that Pivotal the company inherited), executive Scott Yara touted the size of EMC’s Hadoop engineering team and the quality of its technology over that of its smaller rivals Cloudera, MapR and Hortonworks. However, Pivotal has been getting noticeably more in touch with its open source side recently, including with the Hortonworks partnership referenced above (around the Apache Ambari project) and a big commitment to the open source Tachyon in-memory file system project.

The current Pivotal HD Enterprise architecture.

The current Pivotal HD Enterprise architecture.

Pivotal has been a big proponent of the “data lake” strategy whereby companies store all their data in a big Hadoop cluster and use various higher-level programs to access and analyze. Last April, the company took a somewhat brave step toward ensuring its customers could do that by relaxing its product licensing and making Pivotal HD storage free.

Whatever happens with Pivotal’s technology, it’s not shocking that the company would decide to take the open source path. Its flagship technology is the open source Cloud Foundry cloud computing platform, and Cloudera, Hortonworks and MapR have cornered the market on Hadoop sales, by all accounts. If Pivotal has some good code in its base, it’s probably best to get it into the open source world and ride the momentum rather try to fight against it.

For more on the fast-moving Hadoop space, be sure to attend our Structure Data conference March 18-19 in New York. We’ll have the CEOs of Cloudera, Hortonworks and MapR on stage to talk about the business, as well as Databricks CEO Ion Stoica discussing the Apache Spark project that is presently kicking parts of Hadoop into hyperdrive.

Hortonworks and 3 large users launch a project to secure Hadoop

Hadoop vendor Hortonworks, along with customers Target, Merck and Aetna, and software vendor SAS, has started a new group designed to ensure that data stored inside Hadoop systems is only used how it’s supposed to be used and seen by whom it’s supposed to be seen. The effort, called the Data Governance Initiative, will function as an open source project and will address the concerns of enterprises that want to store more data in Hadoop but fear the system won’t match industry regulations or stand up to audits.

The group is similar in spirit to the Open Compute Foundation, which launched in 2011. Facebook spearheaded the Open Compute Project effort and drove a lot of early innovation, but has seen lots of contributions and involvement from technology companies such as Microsoft and end-user companies such as Goldman Sachs. Tim Hall, Hortonworks’ vice president of project management, said Target, Merck and Aetna will be active contributors to the new Hadoop organization — sharing their business and technical expertise in the markets in which they operate, as well as developing and deploying code.

Among the rationale for creating the Data Governance Initiative were questions about the sustainability of the Hortonworks open source business model, some which were brought to light with the revenue numbers it published as part of its initial public offering process, Hall acknowledged. The idea is that this group will demonstrate Hortonworks’ commitment to enterprise concerns and work with large companies to solve them. It will also show how Hortonworks can drive Hadoop innovation without abandoning its open source model.

“We want to make sure folks understand it’s not just these software companies we can work with,” Hall said, referencing the initial phases of Hadoop development led by companies such as Yahoo and Facebook.

A high-level view of Apache Falcon.

A high-level view of Apache Falcon.

Hortonworks plans to publish more information about the Data Governance Initiative’s technical roadmap and early work in February, but the Apache Falcon and Apache Ranger projects that Hortonworks backs will be key components, and there will be an emphasis on maintaining policies as data moves between Hadoop and other data systems. Code will be contributed back to the Apache Software Foundation.

Hall said any companies are welcome to join — including Hadoop rivals such as MapR and Cloudera, which has its own pet projects around Hadoop security —  but, he noted, “It’s up to the other vendors to recognize the value that’s being created here.”

“There’s no reason why Cloudera couldn’t wire up their [Apache] Sentry project to this,” Hall added. “. . . We’d be happy to have them participate in this once it goes into incubator status.”

Of course, Hadoop competition being what it is, he might well suspect that won’t happen anytime soon. Cloudera actually published a well-timed blog post on Wednesday morning touting the security features of its Hadoop distribution.

You can hear all about the Hadoop space at our Structure Data conference in March, where Hortonworks CEO Rob Bearden, Cloudera CEO Tom Reilly and MapR CEO John Schroeder will each share their visions of where the technology is headed.

Update: This post was updated at 12:20 to correct the name of the organization. It is the Data Governance Initiative, not the Data Governance Institute.

The 5 stories that defined the big data market in 2014

There is no other way to put it: 2014 was a huge year for the big data market. It seems years of talk about what’s possible are finally giving way to some real action on the technology front — and there’s a wave of cash following close behind it.

Here are the five stories from the past year that were meaningful in their own rights, but really set the stage for bigger things to come. We’ll discuss many of these topics in depth at our Structure Data conference in March, but until then feel free to let me know in the comments what I missed, where I went wrong or why I’m right.

5. Satya Nadella takes the reins at Microsoft

Microsoft CEO Satya Nadella has long understood the importance of data to the company’s long-term survival, and his ascendance to the top spot ensures Microsoft won’t lose sight of that. Since Nadella was appointed CEO in February, we’ve already seen Microsoft embrace the internet of things, and roll out new data-centric products such as Cortana, Skype Translate and Azure Machine Learning. Microsoft has been a major player in nearly every facet of IT for decades and how it executes in today’s data-driven world might dictate how long it remains in the game.

Microsoft CEO Satya Nadella speaks at a Microsoft Cloud event. Photo by Jonathan Vanian/Gigaom

Satya Nadella speaks at a Microsoft Cloud event.

4. Apache Spark goes legit

It was inevitable that the Spark data-processing framework would become a top-level project within the Apache Software Foundation, but the formal designation felt like an official passing-of-the-torch nonetheless. Spark promises to do for the Hadoop ecosystem all the things MapReduce never could around speed and usability, so it’s no wonder Hadoop vendors, open source projects and even some forward-thinking startups are all betting big on the technology. Databricks, the first startup trying to commercialize Spark, has benefited from this momentum, as well.

Ion Stoica

Spark co-creator and Databricks CEO Ion Stoica.

3. IBM bets its future on Watson

Big Blue might have abandoned its server and microprocessor businesses, but IBM is doubling down on cognitive computing and expects its new Watson division to grow into a $10 billion business. The company hasn’t wasted any time trying to get the technology into users’ hands — it has since announced numerous research and commercial collaborations, highlighted applications built atop Watson and even worked Watson tech into the IBM cloud platform and a user-friendly analytics service. IBM’s experiences with Watson won’t only affect its bottom line; they could be a strong indicator of how enterprises will ultimately use artificial intelligence software.

watson headquarters

A shot of IBM’s new Watson division headquarters in Manhattan.

2. Google buys DeepMind

It’s hard to find a more exciting technology field than artificial intelligence right now, and deep learning is the force behind a lot of that excitement. Although there were a myriad of acquisitions, startup launches and research breakthroughs in 2014, it was Google’s acquisition of London-based startup DeepMind in January that set the tone for the year. The price tag, rumored to be anywhere from $450 million to $628 million, got the mainstream technology media paying attention, and it also let deep learning believers (including those at competing companies) know just how important deep learning is to Google.

Jeffrey Dean - Google Fellow, Google

Google’s Jeff Dean talks about early deep learning results at Structure 2013.

1. Hortonworks goes public

Cloudera’s massive (and somewhat convoluted) deal with Intel boosted the company’s valuation past $4 billion and sent industry-watchers atwitter, but the Hortonworks IPO in December was really a game-changer. It came faster than most people expected, was more successful than many people expected, and should put the pressure on rivals Cloudera and MapR to act in 2015. With a billion-plus-dollar market cap and public market trust, Hortonworks can afford to scale its business and technology — and maybe even steal some valuable mindshare — as the three companies vie to own what could be a humongous software market in a few years’ time.


Hortonworks rings the opening bell on its IPO day.

Honorable mentions

Red Hat’s success aside, it’s hard to profit from free

Red Hat, which just reported a profit of $47.9 million (or 26 cents a share) on revenue of $456 million for its third quarter, has managed to pull off a tricky feat: It’s been able to make money off of free, well, open-source, software. (It’s profit for the year-ago quarter was $52 million.)

In a blog post, [company]Red Hat [/company]CEO Jim Whitehurst said the old days when IT pros risked their careers by betting on open source rather than proprietary software are over. That old adage that you can’t be fired for buying [company]IBM[/company] should be updated, I guess.

In what looks something like a victory lap, Whitehurst wrote that every company now runs some sort of open source software. He wrote:

Many of us remember the now infamous “Halloween Documents,” the classic quote from former Microsoft CEO Steve Ballmer describing Linux as a “cancer,” and comments made by former Microsoft CEO Bill Gates, saying, “So certainly we think of [Linux] as a competitor in the student and hobbyist market. But I really do not think in the commercial market, we’ll see it [compete with Windows] in any significant way.”

He contrasted that to Ballmer successor’s Satya Nadella’s professed love of Linux. To be fair, Azure was well down the road to embracing open source late in Ballmer’s reign but Microsoft’s transition from open-source basher to open-source lover is still noteworthy — and indicative of open-source software’s wide spread adoption. If you can’t beat ’em, join ’em.

Open source is great, but profitable?

Red Hat CEO Jim Whitehurst

Red Hat CEO Jim Whitehurst

So everyone agrees that open source is goodness. But not everyone is sure that many companies will be able to replicate Red Hat’s success profiting from it.

Sure, [company]Microsoft[/company] wants people to run Linux and Java and whatever on Azure because that gives Azure a critical mass of new-age users who are not necessarily enamored of .NET and Windows. And, Microsoft has lots of revenue opportunities once those developers and companies are on Azure. (The fact that Microsoft is open-sourcing .NET is icing on the open-source cake.)

But how does a company that is 100 percent focused on say, selling support and services and enhancements to Apache Hadoop, make money?  A couple of these companies are extremely well-funded and it’s unclear where the cash burn ends and the profits can begin.

Replicating Red Hat — no easy task

Gigaom Research Analyst Andrew Brust has a good take on Hortonworks as a potential tracking stock for those who want to see if the open-source-plus-IPO-model will pay off. As he states:

“Hadoop is becoming a universal data layer, increasingly embedded in other software. Open source may not be the fastest road to monetizing software, but it is a super highway for establishing standards that gain rapid industry-wide support.”

In an interesting blog post coming about a month before the Hortonworks IPO, Host Analytics CEO Dave Kellogg said Red Hat’s model may be hard for Hortonworks and others to replicate. In his view, Red Hat’s model of selling professional services, support and maintenance for Red Hat Enterprise Linux (RHEL) operating system and JBoss middleware works because these products are relatively low-level infrastructure. In his words:

  • The lower-level the category the more customers want support on it.
  • The more you can commoditize the layers below you, the more the market likes it. Red Hat does this for servers.
  • The lower-level the category the more the market actually “wants” it standardized in order to minimize entropy. This is why low-level infrastructure categories become natural monopolies or oligopolies.

And even given Red Hat’s success, it is still a small company compared to commercial software giants like [company]Oracle[/company], Microsoft, IBM etc., as Kellogg also pointed out.

RHT Market Cap Chart

RHT Market Cap data by YCharts

So, the big question is whether a new generation of open-source-rooted companies — in big data, in analytics, in middleware — can wring profits out of what is essentially free stuff. I’m not convinced.

That is not to say there can’t be a highly profitable exit. Something along the lines of Oracle’s $7.4 billion pick-up of Java and MySQL via Sun Microsystems. As one wag said at the time: “Doesn’t [Oracle Chairman] Larry Ellison know he could have just downloaded MySQL for free?”