Cloudera CEO declares victory over big data competition

Cloudera CEO Tom Reilly doesn’t often mince words when it comes to describing his competition in the Hadoop space, or Cloudera’s position among those other companies. In October 2013, Reilly told me he didn’t consider Hortonworks or MapR to be Cloudera’s real competition, but rather larger data-management companies such as IBM and EMC-VMware spinoff Pivotal. And now, Reilly says, “We declare victory over at least one of our competitors.”

He was referring to Pivotal, and the Open Data Platform, or ODP, alliance it helped launched a couple weeks ago along with [company]Hortonworks[/company], [company]IBM[/company], [company]Teradata[/company] and several other big data vendors. In an interview last week, Reilly called that alliance “a ruse and, frankly, a graceful exit for Pivotal,” which laid off a number of employees working on its Hadoop distribution and is now outsourcing most of its core Hadoop development and support to Hortonworks.

You can read more from Reilly below, including his takes on Hortonworks, Hadoop revenues and Spark, as well as some expanded thoughts on the ODP. For more information about the Open Data Platform from the perspectives of the members, you can read our coverage of its launch in mid-February as well as my subsequent interview with Hortonworks CEO Rob Bearden, who explains in some detail how that alliance will work.

If you want to hear about the fast-changing, highly competitive and multi-billion-dollar business of big data straight from horses’ mouths, make sure to attend our Structure Data conference March 18 and 19 in New York. Speakers include Cloudera’s Reilly and Hortonworks’ Bearden, as well as MapR CEO John Schroeder, Databricks CEO (and Spark co-creator) Ion Stoica, and other big data executives and users, including those from large firms such as [company]Lockheed Martin[/company] and [company]Goldman Sachs[/company].

GIGAOM STRUCTURE DATA 2014

You down with ODP? No, not me

While Hortonworks explains the Open Data Platform essentially as a way for member companies to build on top of Hadoop without, I guess, formally paying Hortonworks for support or embracing its entire Hadoop distribution, Reilly describes it as little more than a marketing ploy. Aside from calling it a graceful exit for Pivotal (and, arguably, IBM), he takes issue with even calling it “open.” If the ODP were truly open, he said, companies wouldn’t have to pay for membership, Cloudera would have been invited and, when it asked about the alliance, it wouldn’t have been required to sign a non-disclosure agreement.

What’s more, Reilly isn’t certain why the ODP is really necessary technologically. It’s presently composed of four of the most mature Hadoop components, he explained, and a lot of companies are actually trying to move off of MapReduce (to Spark or other processing engines) and, in some cases, even the Hadoop Distributed File System. Hortonworks, which supplied the ODP core and presumably will handle much of the future engineering work, will be stuck doing the other members’ bidding as they decide which of several viable SQL engines and other components to include, he added.

“I don’t think we could have scripted [the Open Data Platform news] any better,” Reilly said. He added, “[T]he formation of the ODP … is a big shift in the landscape. We think it’s a shift to our advantage.”

(If you want a possibly more nuanced take on the ODP, check out this blog post by Altiscale CEO Raymie Stata. Altiscale is an ODP member, but Stata has been involved with the Apache Software Foundation and Hadoop since his days as Yahoo CTO and is a generally trustworthy source on the space.)

Hortonworks CEO Rob Bearden at Structure Data 2014.

Hortonworks CEO Rob Bearden at Structure Data 2014.

Really, Hortonworks isn’t a competitor?

Asked about the competitive landscape among Hadoop vendors, Reilly doubled down on his assessment from last October, calling Cloudera’s business model “a much more aggressive play [and] a much bolder vision” than what Hortonworks and MapR are doing. They’re often “submissive” to partners and treat Hadoop like an “add-on” rather than a focal point. If anything, Hortonworks has burdened itself by going public and by signing on to help prop up the legacy technologies that IBM and Pivotal are trying to sell, Reilly said.

Still, he added, Cloudera’s “enterprise data hub” strategy is more akin to the IBM and Pivotal business models of trying to become the centerpiece of customers’ data architectures by selling databases, analytics software and other components beside just Hadoop.

If you don’t buy that logic, Reilly has another argument that boils down to money. Cloudera earned more than $100 million last year (that’s GAAP revenue, he confirmed), while Hortonworks earned $46 million and, he suggested, MapR likely earned a similar number. Combine that with Cloudera’s huge investment from Intel in 2014 — it’s now “the largest privately funded enterprise software company in history,” Reilly said — and Cloudera owns the Hadoop space.

“We intend to take advantage” of this war chest to acquire companies and invest in new products, Reilly said. And although he wouldn’t get into specifics, he noted, “There’s no shortage of areas to look in.”

Diane Bryant, senior vice president and general manager of Intel's Data Center Group, at Structure 2014.

Diane Bryant, senior vice president and general manager of Intel’s Data Center Group, at Structure 2014.

The future is in applications

Reilly said that more than 60 percent of Cloudera sales are now “enterprise data hub” deployments, which is his way of saying its customers are becoming more cognizant of Hadoop as an application platform rather than just a tool. Yes, it can still store lots of data and transform it into something SQL databases can read, but customers are now building new applications for things like customer churn and network optimization with Hadoop as the core. Between 15 and 20 financial services companies are using Cloudera to power detect money laundering, he said, and Cloudera has trained its salesforce on a handful of the most popular use cases.

One of the technologies helping make Hadoop look a lot better for new application types is Spark, which simplifies the programming of data-processing jobs and runs them a lot faster than MapReduce does. Thanks to the YARN cluster-management framework, users can store data in Hadoop and process it using Spark, MapReduce and other processing engines. Reilly reiterated Cloudera’s big investment and big bet on Spark, saying that he expects a lot of workloads will eventually run on it.

Databricks CEO (and AMPLab co-director) Ion Stoica.

Databricks CEO (and Spark co-creator) Ion Stoica.

A year into the Intel deal and …

“It is a tremendous partnership,” Reilly said.

[company]Intel[/company] has been integral in helping Cloudera form partnerships with companies such as Microsoft and EMC, as well as with customers such as MasterCard, he said. The latter deal is particularly interesting because Cloudera and Intel’s joint engineering on hardware-based encryption helped Cloudera deploy a PCI-compliant Hadoop cluster and MasterCard is now out pushing that system to its own clients via its MasterCard Advisors professional services arm.

Reilly added that Cloudera and Intel are also working together on new chips designed specifically for analytic workloads, which will take advantage of non-RAM memory types.

Asked whether Cloudera’s push to deploy more workloads in cloud environments is at odds with Intel’s goal to sell more chips, Reilly pointed to Intel’s recent strategy of designing chips especially for cloud computing environments. The company is operating under the assumption that data has gravity and that certain data that originates in the cloud, such as internet-of-things or sensor data, will stay there, while large enterprises will continue to store a large portion of their data locally.

Wherever they run, Reilly said, “[Intel] just wants more workloads.”

Cloudera claims more than $100M in revenue in 2014

Hadoop vendor Cloudera announced on Tuesday that the company’s “[p]reliminary unaudited total revenue surpassed $100 million” in 2014. That the company, which is still privately held, would choose to disclose even that much information about its finances speaks to the fast maturation, growing competition and big egos in the Hadoop space.

While $100 million is a nice, round benchmark number, the number by itself doesn’t mean much of anything. We still don’t know how much profit Cloudera made last year or, more likely, how big of a loss it sustained. What we do know, however, is that it earned more than bitter rivals [company]Hortonworks[/company] (it claimed $33.4 million through the first nine months of 2014, and will release its first official earnings report next week) and probably MapR (I’ve reached out to MapR about this and will update this if I’m wrong). However, Cloudera claims 525 customers are paying for its software (an 85 percent improvement since 2013), while MapR in December claimed more than 700 paying customers.

Cloudera also did about as much business as EMC-VMware spinoff Pivotal claims its big data business did in 2014. On Tuesday, Pivotal open sourced much of its Hadoop and database technology, and teamed up with Hortonworks and a bunch of software vendors large and small to form a new Hadoop alliance called the Open Data Platform. Cloudera’s Mike Olson, the company’s chief strategy officer and founding CEO, called the move, essentially, disingenuous and more an attempt to save Pivotal’s business than a real attempt to advance open source Hadoop software.

Hortonworks CEO Rob Bearden at Structure Data 2014.

Hortonworks CEO Rob Bearden at Structure Data 2014.

All of this grandstanding and positioning is part of a quest to secure business in a Hadoop market that analysts predict will be worth billions in the years to come, and also an attempt by each company to prove to potential investors that its business model is the best. Hortonworks surprised a lot of people by going pubic in December, and the stock has remained stable since then (although its share price dropped more than two percent on Tuesday despite the news with Pivotal). Many people suspect Cloudera and MapR will go public this year, and Pivotal at some point as well.

This much action should make for an entertaining and informative Structure Data conference, which is now less than a month away. We’ll have the CEOs of Cloudera, Hortonworks and MapR all on stage talking about the business of big data, as well as the CEO of Apache Spark startup Databricks, which might prove to be a great partner for Hadoop vendors as well as a thorn in their sides. Big users, including Goldman Sachs, ESPN and Lockheed Martin, will also be talking about the technologies and objectives driving their big data efforts.

Meet Myriad, a new project for running Hadoop on Mesos

Hadoop vendor MapR and data center automation startup Mesosphere have created an open source technology called Myriad, which is supposed to make it easier to run Hadoop workloads on top of the popular Mesos cluster-management software. More specifically, Myriad allows the YARN resource scheduler — the linchpin of Hadoop 2.0 that lets the platform run processing frameworks other than MapReduce — to run on top of Mesos, effectively creating an auto-scaling Hadoop cluster that’s relatively future-proof.

“Before, you had to make a choice, and now you can just run YARN on Mesos,” explained Mesosphere founder and CEO Florian Leibert. “… I think the goal here is to have more workloads in a shared environment.”

What he means is that companies will no longer have to run Hadoop on one set of resources, while running the web servers, Spark and any other number of workloads on other resources managed by Mesos. Essentially, all of these things will now be available as data center services residing on the same set of machines. Mesos has always supported Hadoop as a workload type — and companies including Twitter and Airbnb have taken advantage of this — but YARN has appeal as the default resource manager for newer distributions of Hadoop because it’s designed specifically for that platform and, well, is one of the foundations of those newer distributions.

The old static partition.

The old static partition.

With Myriad, YARN can still manage the resource allocation to Hadoop jobs, while Mesos handles other tasks as well as the task of scaling out the YARN cluster itself. So instead of the current state of affairs, where YARN clusters are statically defined and new nodes must be manually configured, Mesos can spin up new YARN nodes automatically based on the policies in place and the available resources of the cluster.

Mesosphere engineer Adam Bordelon said Myriad works now and that eBay and Twitter have been testing it out. eBay actually contributed quite a lot to the first version of the code. However, he noted, Myriad still early in its development and needs quite a few more features, including around security.

“I imagine within a month or two,” he said, “it should be in production somewhere.”

Despite the fact that two commercial companies are driving Myriad at this point, Bordelon said the goal is definitely to build a community around the project. It’s currently hosted in the Mesosphere GitHub repository, but the team is currently working on a proposal to make it an Apache Incubator project.

“It is definitely a community effort,” he said.

The new YARN-on-Mesos architecture.

The new YARN-on-Mesos architecture.

Jim Scott, MapR’s director of enterprise strategy and architecture, said that Hadoop was pitched in part as a tool for eliminating data silos. However, he added, “As we start see those data silo walls come down, we’re starting to see other walls come up.” One of those walls is the relegation of Hadoop to its own dedicated cluster far away, logically at least, from everything else.

“This is the enabling function, in my mind,” he said, “that makes it so people can tear that wall down.”

MapR CEO John Schroeder will be among many speakers talking about the evolution of Hadoop and big data architectures at our Structure Data conference in New York next month. Others include Cloudera CEO Tom Reilly, Hortonworks CEO Rob Bearden, Google VP of Infrastructure Eric Brewer, Databricks CEO Ion Stoica and Amazon Web Services GM of Data Science Matt Wood.

And for more on Mesos, Mesosphere and why they have some engineers so excited, check out our May 2014 Structure Show podcast interview with Mesosphere CEO Leibert.

[soundcloud url=”https://api.soundcloud.com/tracks/151905825″ params=”color=ff5500&auto_play=false&hide_related=false&show_artwork=true” width=”100%” height=”166″ iframe=”true” /]

Download This Episode 

Subscribe in iTunes

The Structure Show RSS Feed

Why the Hortonworks IPO could be a bellwether for Hadoop

Hortonworks’ IPO filing on Monday shows that Hadoop is still a resource- and risk-intensive business, but also suggests it’s one that public market investors will be willing to back. It might also start the ball rolling for long-anticipated moves in Hadoop.

MapR raises $110M to fuel its enterprise Hadoop push

MapR has raised $110 million, $80 million of which is equity financing, in order to fuel its growing Hadoop business in the face of better-known rivals Cloudera and Hortonworks. Like those companies, MapR says it has the winning strategy and aims to be a publc company.

Hadoop maturity summits

Last week’s Hadoop Summit brought announcements that further galvanize Hadoop’s versatility and mainstream status.

Spark is now part of MapR’s Hadoop distro, too

MapR is the latest Hadoop vendor to embrace Apache Spark, adding the entire Spark stack of technologies to its distribution. It’s a smart move by MapR, but just more validation that Spark might be the data-processing framework of the future.

MapR now supports YARN, puts HP Vertica on top of Hadoop

MapR is continuing along its path to Hadoop glory with new support for the YARN resource manager and a direct integration with the HP Vertica analytic database. In such a competitive space, every little edge matters.

Maybe big data is the killer app for Google’s cloud

Hadoop is popular and so is cloud computing, so it comes as no surprise that a battle would break out to establish the best place for running Hadoop. Lately, Google has been scoring some victories on the user side.