Cloudera claims more than $100M in revenue in 2014

Hadoop vendor Cloudera announced on Tuesday that the company’s “[p]reliminary unaudited total revenue surpassed $100 million” in 2014. That the company, which is still privately held, would choose to disclose even that much information about its finances speaks to the fast maturation, growing competition and big egos in the Hadoop space.

While $100 million is a nice, round benchmark number, the number by itself doesn’t mean much of anything. We still don’t know how much profit Cloudera made last year or, more likely, how big of a loss it sustained. What we do know, however, is that it earned more than bitter rivals [company]Hortonworks[/company] (it claimed $33.4 million through the first nine months of 2014, and will release its first official earnings report next week) and probably MapR (I’ve reached out to MapR about this and will update this if I’m wrong). However, Cloudera claims 525 customers are paying for its software (an 85 percent improvement since 2013), while MapR in December claimed more than 700 paying customers.

Cloudera also did about as much business as EMC-VMware spinoff Pivotal claims its big data business did in 2014. On Tuesday, Pivotal open sourced much of its Hadoop and database technology, and teamed up with Hortonworks and a bunch of software vendors large and small to form a new Hadoop alliance called the Open Data Platform. Cloudera’s Mike Olson, the company’s chief strategy officer and founding CEO, called the move, essentially, disingenuous and more an attempt to save Pivotal’s business than a real attempt to advance open source Hadoop software.

Hortonworks CEO Rob Bearden at Structure Data 2014.

Hortonworks CEO Rob Bearden at Structure Data 2014.

All of this grandstanding and positioning is part of a quest to secure business in a Hadoop market that analysts predict will be worth billions in the years to come, and also an attempt by each company to prove to potential investors that its business model is the best. Hortonworks surprised a lot of people by going pubic in December, and the stock has remained stable since then (although its share price dropped more than two percent on Tuesday despite the news with Pivotal). Many people suspect Cloudera and MapR will go public this year, and Pivotal at some point as well.

This much action should make for an entertaining and informative Structure Data conference, which is now less than a month away. We’ll have the CEOs of Cloudera, Hortonworks and MapR all on stage talking about the business of big data, as well as the CEO of Apache Spark startup Databricks, which might prove to be a great partner for Hadoop vendors as well as a thorn in their sides. Big users, including Goldman Sachs, ESPN and Lockheed Martin, will also be talking about the technologies and objectives driving their big data efforts.

Pivotal open sources its Hadoop and Greenplum tech, and then some

Pivotal, the cloud computing and big data company that spun out from EMC and VMware in 2013, is open sourcing its entire portfolio of big data technologies and is teaming up with Hortonworks, IBM, GE, and several other companies on a Hadoop effort called the Open Data Platform.

Rumors about the fate of the company’s data business have been circulating since a round of layoffs began in November, but, according to Pivotal, the situation isn’t as dire as some initial reports suggested.

There is a lot of information coming out of the company about this, but here are the key parts:

  • Pivotal is still selling licenses and support for its Greenplum, HAWQ and GemFire database products, but it is also releasing the core code bases for those technologies as open source.
  • Pivotal is still offering its own Hadoop distribution, Pivotal HD, but has slowed development on core components of MapReduce, YARN, Ambari and the Hadoop Distributed File System. Those four pieces are the starting point for a new association called the Open Data Platform, which includes Pivotal, [company]GE[/company], [company]Hortonworks[/company], [company]IBM[/company], Infosys, Pivotal, SAS, Altiscale, [company]EMC[/company], [company]Verizon[/company] Enterprise Solutions, [company]VMware[/company], [company]Teradata[/company] and “a large international telecommunications firm,” and which promises to build its Hadoop technologies using a standard core of code.
  • Pivotal is working with Hortonworks to make Pivotal’s big data technologies run on the Hortonworks Data Platform, and eventually on the Open Data Platform core. Pivotal will continue offering enterprise support for Pivotal HD, although it will outsource to Hortonworks support requests involving the guts of Hadoop (e.g., MapReduce and HDFS).

Sunny Madra, vice president of the data and mobile product group at Pivotal, said the company has a relatively successful big data business already — $100 million overall, $40 million of which came from the Big Data Suite license bundle it announced last year — but suggested that it sees the writing on the wall. Open source software is a huge industry trend, and he thinks pushing against it is as fruitless as pushing against cloud computing several years ago.

“We’re starting to see open source pop up as an RFP within enterprises,” he said. “. . . If you’re picking software [today] . . . you’d look to open source.”


The Pivotal Big Data Suite.

Madra pointed to Pivotal’s revenue numbers as proof the company didn’t open source its software because no one wanted to pay for it. “We wouldn’t have a $100 million business . . . if we couldn’t sell this,” he said. Maybe, but maybe not: Hortonworks isn’t doing $100 million a year, but word was that Cloudera was doing it years ago (on Tuesday, Cloudera did claim more than $100 million in revenue in 2014). Depending how one defines “big data,” companies like Microsoft and Oracle are probably making much more money.

However, there were some layoffs late last year, which Madra attributed to consolidation of people, offices and efforts rather than a failing business. Pivotal wanted to close some global offices and bring the data team and Cloud Foundry teams under the same leadership, and to focus its development resources on its own intellectual property around Hadoop. “Do we really need a team going and testing our own distribution?” he asked, troubleshooting it, certifying it against technologies and all that goes along with that?

EMC first launched the Pivotal HD Hadoop distribution, as well as the HAWQ SQL-on-Hadoop engine, with much ado just over two years ago.

The deal with Hortonworks helps alleviate that engineering burden in the short term, and the Open Data Platform is supposed to help solve it over a longer period. Madra explained the goal of the organization as Linux-like, meaning that customers should be able to switch from one Hadoop distribution to the next and know the kernel will be the same, just like they do with the various flavors of the Linux operating system.

Mike Olson, Cloudera’s chief strategy officer and founding CEO, offered a harsh rebuttal to the Open Data Platform in a blog post on Tuesday, questioning the utility and politics of vendor-led consortia like this. He simultaneously praised Hortonworks for its commitment to open source Hadoop and bashed Pivotal on the same issue, but wrote, among other things, of the Open Data Platform: “The Pivotal and Hortonworks alliance, notwithstanding the marketing, is antithetical to the open source model and the Apache way.”

The Pivotal HD and Hawq architecture

Much of this has been open sourced or replaced.

As part of Pivotal’s Tuesday news, the company also announced additions to its Big Data Suite package, including the Redis key-value store, RabbitMQ messaging queue and Spring XD data pipeline framework, as well as the ability to run the various components on the company’s Cloud Foundry platform. Madra actually attributes a lot of Pivotal’s decision to open source its data technologies, as well as its execution, to the relative success the company has had with Cloud Foundry, which has always involved an open source foundation as well as a commercial offering.

“Had we not had the learnings that we had in Cloud Foundry, then I think it would have been a lot more challenging,” he said.

Whether or not one believes Pivotal’s spin on the situation, though, the company is right in realizing that it’s open source or bust in the big data space right now. They have different philosophies and strategies around it, but major Hadoop vendors Cloudera, Hortonworks and MapR are all largely focused on open-source technology. The most popular Hadoop-ecosystem technologies, including Spark, Storm and Kafka, are open source, as well. (CEOs, founders and creators from many of these companies and projects will be speaking at our Structure Data conference next month in New York.)

Pivotal might eventually sell billions of dollars worth of software licenses for its suite of big data products — there’s certainly a good story there if it can align the big data and Cloud Foundry businesses into a cohesive platform — but it probably has reached its plateau without having an open source story to tell.

Update: This post was updated at 12:22 p.m. PT to add information about Cloudera’s revenue.

Facebook’s latest homemade hardware is a 128-port modular switch

Facebook has been building its own servers and storage gear for years, and last June announced its first-ever networking gear in the form of a top-of-rack switch called “Wedge.” On Wednesday, the company furthered its networking story with a new switch platform called “6-pack,” which is essentially a bunch of Wedge switches crammed together inside a single box.

The purpose of 6-pack was to build a modular platform that can handle the increase in network traffic that Facebook’s recently deployed “Fabric” data center architecture enables. The Facebook blog post announcing 6-pack goes into many more details of the design, but here is the gist:

“It is a full mesh non-blocking two-stage switch that includes 12 independent switching elements. Each independent element can switch 1.28Tbps. We have two configurations: One configuration exposes 16x40GE ports to the front and 640G (16x40GE) to the back, and the other is used for aggregation and exposes all 1.28T to the back. Each element runs its own operating system on the local server and is completely independent, from the switching aspects to the low-level board control and cooling system. This means we can modify any part of the system with no system-level impact, software or hardware. We created a unique dual backplane solution that enabled us to create a non-blocking topology.”

In an interview about 6-pack, lead engineer Yuval Bachar described its place in the network fabric as the level above the top-of-rack Wedge switches. Facebook might have hundreds of 6-pack appliances within a given data center managing traffic coming from its untold thousands of server racks.

A 6-pack line card.

A 6-pack line card.

“We just add those Lego blocks, as many as we need, to build this,” he said.

Matt Corddry, Facebook’s director of engineering and hardware team lead, said all the focus on building networking gear is because Facebook user growth keeps growing as more of the world comes online, and the stuff they’re sharing is becoming so much richer, in the form of videos, high-resolution photos and the like.

That might be the broader goal, but Facebook also has a business-level goal that’s behind its decision to build its own gear in the first place, and to launch the open source Open Compute Project. Essentially, Facebook wants to push hardware vendors to deliver the types of technology it needs. If it can’t get them to build custom gear, it and dozens of other large-scale Open Compute partners can with immense buying power can at least push the Dells and HPs and Ciscos of the world in the right direction.

Corddry said there’s nothing to report yet about Wedge or 6-pack being used anywhere outside Facebook but, he noted, “Our plan is to release the full package of Wedge in the near future to Open Compute.”


If you’re interested in hearing more about Facebook’s data center fabric, check out our recent Structure Show podcast interview with Facebook’s director of network engineering, Najam Ahmad.

[soundcloud url=”″ params=”color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false” width=”100%” height=”166″ iframe=”true” /]

Download This Episode

Subscribe in iTunes

The Structure Show RSS Feed

Meet Myriad, a new project for running Hadoop on Mesos

Hadoop vendor MapR and data center automation startup Mesosphere have created an open source technology called Myriad, which is supposed to make it easier to run Hadoop workloads on top of the popular Mesos cluster-management software. More specifically, Myriad allows the YARN resource scheduler — the linchpin of Hadoop 2.0 that lets the platform run processing frameworks other than MapReduce — to run on top of Mesos, effectively creating an auto-scaling Hadoop cluster that’s relatively future-proof.

“Before, you had to make a choice, and now you can just run YARN on Mesos,” explained Mesosphere founder and CEO Florian Leibert. “… I think the goal here is to have more workloads in a shared environment.”

What he means is that companies will no longer have to run Hadoop on one set of resources, while running the web servers, Spark and any other number of workloads on other resources managed by Mesos. Essentially, all of these things will now be available as data center services residing on the same set of machines. Mesos has always supported Hadoop as a workload type — and companies including Twitter and Airbnb have taken advantage of this — but YARN has appeal as the default resource manager for newer distributions of Hadoop because it’s designed specifically for that platform and, well, is one of the foundations of those newer distributions.

The old static partition.

The old static partition.

With Myriad, YARN can still manage the resource allocation to Hadoop jobs, while Mesos handles other tasks as well as the task of scaling out the YARN cluster itself. So instead of the current state of affairs, where YARN clusters are statically defined and new nodes must be manually configured, Mesos can spin up new YARN nodes automatically based on the policies in place and the available resources of the cluster.

Mesosphere engineer Adam Bordelon said Myriad works now and that eBay and Twitter have been testing it out. eBay actually contributed quite a lot to the first version of the code. However, he noted, Myriad still early in its development and needs quite a few more features, including around security.

“I imagine within a month or two,” he said, “it should be in production somewhere.”

Despite the fact that two commercial companies are driving Myriad at this point, Bordelon said the goal is definitely to build a community around the project. It’s currently hosted in the Mesosphere GitHub repository, but the team is currently working on a proposal to make it an Apache Incubator project.

“It is definitely a community effort,” he said.

The new YARN-on-Mesos architecture.

The new YARN-on-Mesos architecture.

Jim Scott, MapR’s director of enterprise strategy and architecture, said that Hadoop was pitched in part as a tool for eliminating data silos. However, he added, “As we start see those data silo walls come down, we’re starting to see other walls come up.” One of those walls is the relegation of Hadoop to its own dedicated cluster far away, logically at least, from everything else.

“This is the enabling function, in my mind,” he said, “that makes it so people can tear that wall down.”

MapR CEO John Schroeder will be among many speakers talking about the evolution of Hadoop and big data architectures at our Structure Data conference in New York next month. Others include Cloudera CEO Tom Reilly, Hortonworks CEO Rob Bearden, Google VP of Infrastructure Eric Brewer, Databricks CEO Ion Stoica and Amazon Web Services GM of Data Science Matt Wood.

And for more on Mesos, Mesosphere and why they have some engineers so excited, check out our May 2014 Structure Show podcast interview with Mesosphere CEO Leibert.

[soundcloud url=”″ params=”color=ff5500&auto_play=false&hide_related=false&show_artwork=true” width=”100%” height=”166″ iframe=”true” /]

Download This Episode 

Subscribe in iTunes

The Structure Show RSS Feed

Datadog buys Mortar Data, will close its Hadoop cloud service

Datadog, a startup that monitors the performance of users’ cloud computing servers, has acquired big data startup Mortar Data and intends to shutter the company’s existing cloud service.

Mortar launched in 2012 and was initially focused on providing a simple way to run Hadoop jobs in the Amazon Web Services cloud using languages such as Python and Pig. The company eventually launched an open source project to house a lot of its work, including some around building reusable frameworks for common big data applications such as recommendation engines. Recently, Mortar added support for numerous new open source technologies, including Spotify’s Luigi pipeline tool and Apache Spark.

The company had raised $3.2 million in equity and debt financing, including a $1.8 million seed round, according to Crunchbase.

Datadog will be closing down Mortar’s service, and the Mortar team and platform will work inside Datadog analyzing operational data that can be turned into additional analytics for users. Datadog was already a Mortar user. In fact, Chief Product Officer Amit Agarwal said in an interview about the acquisition, “Mortar was our Hadoop, for the most part . . . We already had a common law relationship.”

Datadog has raised more than $50 million in venture capital since launching in 2010, including a recently announced $31 million round.

Mortar Co-founder and CEO K Young said his company is “taking a lot of care” right now to ensure users have a smooth transition away from the hosted Mortar platform. Part of that process involves releasing a lot more components into the open source repository, he said.

Mortar Data had a solid idea and was one of the first to market with a cloud-based Hadoop service, but apparently the company’s approach didn’t resonate with consumers. Or perhaps its business model didn’t scale. I used to compare Mortar to Infochimps and Continuuity, the former of was acquired by CSC in 2013 while trying to raise money and the latter of which changed its name to Cask and open sourced its technology.

Even if you’re focused on data scientists or developers, it’s difficult to compete in the world of big data infrastructure without very deep pockets. Similar startups that have launched since Mortar, including Altiscale and Qubole, raised significantly more capital and appear to be doing decent business. Databricks, the unofficial corporate arm of the Apache Spark project, has also raised boatloads of capital and is banking on a cloud computing service, as well.

That’s not to mention the difficulty of selling users on your service when the cloud providers themselves — particularly [company]Amazon[/company], [company]Microsoft[/company] and [company]Google[/company] — continue rolling out bigger, cheaper and fast data services.

We’ll be talking more about the still-emerging big data market at our Structure Data conference next month in New York. Executives from the world’s leading big data software vendors and cloud providers, as well as many cutting-edge users, will be discussing where the business is headed and what technologies are the next big things.

Hitachi Data Systems to buy Pentaho for $500-$600M

Storage vendor Hitachi Data Systems is set to buy analytics company Pentaho at a price rumored to be between $500 million and $600 million (closer to $500 million, from what I’ve heard). It’s an interesting deal because of its size and because Hitachi wants to move beyond enterprise storage and into analytics for the internet of things. Pentaho sells business intelligence software, and can transform even big data from stream-processing engines for real-time analysis. According to a Hitachi press release, “The result will be unique, comprehensive solutions to address specific challenges through a shared analytics platform.”

Exclusive: Pivotal CEO says open source Hadoop tech is coming

Pivotal, the cloud computing spinoff from EMC and VMware that launched in 2013, is preparing to blow up its big data business by open sourcing a whole lot of it.

Rumors of changes began circulating in November, after CRN reported that Pivotal was in the process of laying off about 60 people, many of which worked on the big data products. The flames were stoked again on Friday by a report in VentureBeat claiming the company might cease development of its Hadoop distribution and/or open source various pieces of its database technology such as Greenplum and HAWQ.

Gigaom has confirmed at least part of this is true, via an emailed statement credited to Pivotal CEO Paul Maritz:

“We are anticipating an interesting set of announcements on Feb 17th. However rumors to the effect that Pivotal is getting out of the Hadoop space are categorically false. The announcements, which involve multiple parties, will greatly add to the momentum in the Hadoop and Open Source space, and will have several dimensions that we believe customers will find very compelling.”

Those announcements will take place via webcast.

Paul Maritz at Structure Data 2014. (© Photo by Jakub Mosur).

Paul Maritz at Structure Data 2014.

Multiple external sources have told Gigaom that Pivotal does indeed plan to open source its Hadoop technology, and that it will work with former rival (but, more recently, partner) Hortonworks to maintain and develop it. IBM was also mentioned as a partner.

Members of the Hadoop team were let go around November when active development stopped, the sources said, and some senior big data personnel — including Senior Vice President of R&D Hugh Williams and Chief Scientist Milind Bhandarkar — departed the company in December, according to their LinkedIn profiles. Both of them claim to be working on new startup projects.

When EMC first introduced its Hadoop distribution, called Pivotal HD in February 2013 (it was one of the technologies that Pivotal the company inherited), executive Scott Yara touted the size of EMC’s Hadoop engineering team and the quality of its technology over that of its smaller rivals Cloudera, MapR and Hortonworks. However, Pivotal has been getting noticeably more in touch with its open source side recently, including with the Hortonworks partnership referenced above (around the Apache Ambari project) and a big commitment to the open source Tachyon in-memory file system project.

The current Pivotal HD Enterprise architecture.

The current Pivotal HD Enterprise architecture.

Pivotal has been a big proponent of the “data lake” strategy whereby companies store all their data in a big Hadoop cluster and use various higher-level programs to access and analyze. Last April, the company took a somewhat brave step toward ensuring its customers could do that by relaxing its product licensing and making Pivotal HD storage free.

Whatever happens with Pivotal’s technology, it’s not shocking that the company would decide to take the open source path. Its flagship technology is the open source Cloud Foundry cloud computing platform, and Cloudera, Hortonworks and MapR have cornered the market on Hadoop sales, by all accounts. If Pivotal has some good code in its base, it’s probably best to get it into the open source world and ride the momentum rather try to fight against it.

For more on the fast-moving Hadoop space, be sure to attend our Structure Data conference March 18-19 in New York. We’ll have the CEOs of Cloudera, Hortonworks and MapR on stage to talk about the business, as well as Databricks CEO Ion Stoica discussing the Apache Spark project that is presently kicking parts of Hadoop into hyperdrive.

Why opening up its Cosmos big data system would be the right move for Microsoft

There has been a rumor floating around since August (first, and subsequently, reported by Mary Jo Foley at ZDNet) that Microsoft is preparing to release its Cosmos big data system as a service on its Azure cloud platform, likely as an alternative to Hadoop. That would not only be a bold move for Microsoft, but also, probably, a smart one.

We’ll get to that shortly, but first a brief history of Cosmos. It’s Microsoft’s internal big data system, used to store and process data for applications such as Bing, Internet Explorer and email. Cosmos’ batch-computing element is called Dryad, and it’s similar to — although reportedly much more flexible than — the MapReduce framework associated with Hadoop as well early-2000s Google, where MapReduce was invented. Cosmos also features a SQL-like query engine called SCOPE.

As of early 2011, Microsoft claimed it was storing about 62 petabytes of data in Cosmos. Around that timeframe, Microsoft was also previewing commercial versions of Cosmos/Dryad and pitching it as a better alternative to Hadoop. On the whole, reviews were positive. You can read more about Cosmos here and here.

A graphic showing Cosmos' place in the application architecture, circa 2011.

A graphic showing Cosmos’ place in the application architecture, circa 2011.

Around October 2011, Microsoft began investing heavily in making Hadoop run on Windows, an early — and wise — indication of the company’s move toward embracing both open source technologies as well as technologies that companies and developers have shown an interest in using. The Dryad work was moved to the Windows high-performance computing product line, where it eventually died. In February 2012, Microsoft went whole hog on Hadoop, announcing a partnership with Hortonworks and forthcoming products for both Windows Servers and Azure.

The company has been pretty quiet about Cosmos, Dryad and the whole shebang in the meantime, but in August, ZDNet‘s Foley reported on a Microsoft job posting that suggests the company is developing a version of Cosmos for external consumption. In late January, she expanded on the original report with information that Microsoft is recruiting testers for the Cosmos service, as well as citing improvements to the system’s SQL engine and new storage and computing components.

Microsoft declined to comment about any plans to release any products based on Cosmos.

Microsoft CEO Satya Nadella speaks at a Microsoft cloud event. Photo by Jonathan Vanian/Gigaom

Microsoft CEO Satya Nadella talks about open source at a Microsoft cloud event.

If Microsoft is indeed preparing a Cosmos service on Azure, it’s easy to see why. Data processing and analysis is going to be a major driver of IT spending in the coming decades, and smart companies are going to cover all their bases when it comes to serving customers’ needs. Hadoop is just the platform that every cloud provider, database vendor and analytics vendor needs to support because the community is so large and so many workloads already run on it.

But that doesn’t mean Hadoop is necessarily the best technology for every job, especially for cloud providers that want to control every aspect of a new service from the core code up to the user interface. Google’s Compute Engine platform supports Hadoop, but the company all but said “Hadoop is passé” when it rolled out its post-Hadoop Cloud Dataflow service in June. Databricks, a startup based around the Apache Spark technology, works very closely to integrate Spark with the Hadoop ecosystem but is banking its business on a cloud service that’s all about Spark.

If the Apache Storm stream-processing project was as popular as Hadoop is, perhaps Amazon Web Services would have built something around it rather than starting with its own stream-processing technologies, Kinesis and Lambda. Microsoft, in fact, is also now touting its own stream-processing engine called Trill that already underpins the company’s Azure Stream Analytics service, as well as streaming workloads for the backend systems powering Bing and Halo.

Comparing Trill to other streaming engines.

Comparing Trill to other streaming engines.

We will discuss the business of big data in detail at our Structure Data conference, which takes place March 18 and 19 in New York. Speakers include the CEOs of Hadoop vendors Cloudera, Hortonworks and MapR, as well as executives from Google, Microsoft and Amazon Web Services. And, of course, we will have some of the world’s most-advanced users talking about the tools they use and what they would like to see from the companies selling software.

And new data services, especially among cloud providers,are also about showing off a company’s technological chops, much like their boasts about the number of data centers they own. Engineers like to work on the biggest and best systems around, and developers like to build applications on them. Much like Google has open sourced bits and pieces of its infrastructure in the form of Kubernetes and some Cloud Dataflow libraries, it won’t surprise me if Microsoft decides to open source parts of Cosmos and Trill at some point — perhaps to help drive more interest around its recently open sourced .NET development framework.

There is too much money up for grabs in the cloud computing and big data markets to leave good technology locked up inside a company’s internal towers. As Microsoft, Google and Amazon seek to grab as many cloud workloads as they can and to hire as many talented engineers as they can — in a competitive market that also includes very open-source-friendly companies such as Facebook and Netflix — expect to see a lot more openness about the stuff they’re building, as well as a lot more services based on it.

Cloudera acquires self-service data-modeling startup

Hadoop vendor Cloudera is moving closer to the business intelligence space by acquiring a startup called The company’s software analyzes users’ offline queries in order to determine which ones are most important to the business and how they might be improved upon.

Here’s how the website, well, explains the way its software works:

  • Step 1. only needs your queries to profile and understand the business logic. Intuitive dashboards act as mission control for the workload, showing what queries are critical to your business, what these queries do and which data is accessed most often.
  • Step 2. Generate schemas from your actual data usage to map query access patterns. monitors these patterns on a rolling basis, identifying optimizations in the data model and making modernization recommendations.
  • Step 3. Turn recommendations into new data models without having to hand code. batch transforms existing objects at the click of a button, then monitors their ongoing performance.

A screenshot of

For Cloudera, the acquisition represents a chance to help its customers make better use of data by optimizing the queries for better performance and perhaps helping customers find the right data store for the job. That could be in Cloudera Impala or a NoSQL database or even a standard relational database. Anupam Singh,’s co-founder and CEO described part of the company’s rationale and process in a blog post announcing the acquisition:’s first customer executes nearly 8.4 million (yes, million!) SQL queries annually against various data stores. This begs the question: How many of these queries have access patterns that could benefit from a new data model? The customer did not have a clear answer, and we saw an opportunity. Today,’s profiler is used to identify the most common data access patterns and’s transformation engine is used to generate the schema design for modern data stores such as Impala.

If Cloudera is serious about becoming an enterprise data platform company that lives up to the “enterprise data hub” software it’s selling, these are the types of acquisitions it needs to make and nurture. The name of the game isn’t just getting more data into Hadoop and focusing all development around Hadoop, but working to improve the whole ecosystem of technologies around and above Hadoop, as well.

In the broader Hadoop market, the acquisition is just another move in a game of strategy that has been going on for several years and likely will go for several more to come. Last week, for example, Cloudera rival Hortonworks — which had a successful initial public offering in December — announced a Hadoop data governance initiative along with customers Target, Merck and Aetna, and invited Cloudera to join (hint: it probably won’t). Last year around this time, Cloudera secured a massive investment from Intel and several others that resulted in more than half a billion dollars in the company’s war chest.

Tom Reilly (left) at Structure Data 2014. (c) Jakub Moser /

Cloudera CEO Tom Reilly (left) at Structure Data 2014.

We’ll hear a lot more about all of this at our Structure Data conference in March, which features talks from Cloudera CEO Tom Reilly, Hortonworks CEO Rob Bearden and MapR CEO John Schroeder. is Cloudera’s fourth acquisition, with its most recent being an “acquihire” (to use a Silicon Valley term of art) of data science specialists Datapad in September. was founded in October 2013 and had raised money from Mayfield Fund.

Hortonworks and 3 large users launch a project to secure Hadoop

Hadoop vendor Hortonworks, along with customers Target, Merck and Aetna, and software vendor SAS, has started a new group designed to ensure that data stored inside Hadoop systems is only used how it’s supposed to be used and seen by whom it’s supposed to be seen. The effort, called the Data Governance Initiative, will function as an open source project and will address the concerns of enterprises that want to store more data in Hadoop but fear the system won’t match industry regulations or stand up to audits.

The group is similar in spirit to the Open Compute Foundation, which launched in 2011. Facebook spearheaded the Open Compute Project effort and drove a lot of early innovation, but has seen lots of contributions and involvement from technology companies such as Microsoft and end-user companies such as Goldman Sachs. Tim Hall, Hortonworks’ vice president of project management, said Target, Merck and Aetna will be active contributors to the new Hadoop organization — sharing their business and technical expertise in the markets in which they operate, as well as developing and deploying code.

Among the rationale for creating the Data Governance Initiative were questions about the sustainability of the Hortonworks open source business model, some which were brought to light with the revenue numbers it published as part of its initial public offering process, Hall acknowledged. The idea is that this group will demonstrate Hortonworks’ commitment to enterprise concerns and work with large companies to solve them. It will also show how Hortonworks can drive Hadoop innovation without abandoning its open source model.

“We want to make sure folks understand it’s not just these software companies we can work with,” Hall said, referencing the initial phases of Hadoop development led by companies such as Yahoo and Facebook.

A high-level view of Apache Falcon.

A high-level view of Apache Falcon.

Hortonworks plans to publish more information about the Data Governance Initiative’s technical roadmap and early work in February, but the Apache Falcon and Apache Ranger projects that Hortonworks backs will be key components, and there will be an emphasis on maintaining policies as data moves between Hadoop and other data systems. Code will be contributed back to the Apache Software Foundation.

Hall said any companies are welcome to join — including Hadoop rivals such as MapR and Cloudera, which has its own pet projects around Hadoop security —  but, he noted, “It’s up to the other vendors to recognize the value that’s being created here.”

“There’s no reason why Cloudera couldn’t wire up their [Apache] Sentry project to this,” Hall added. “. . . We’d be happy to have them participate in this once it goes into incubator status.”

Of course, Hadoop competition being what it is, he might well suspect that won’t happen anytime soon. Cloudera actually published a well-timed blog post on Wednesday morning touting the security features of its Hadoop distribution.

You can hear all about the Hadoop space at our Structure Data conference in March, where Hortonworks CEO Rob Bearden, Cloudera CEO Tom Reilly and MapR CEO John Schroeder will each share their visions of where the technology is headed.

Update: This post was updated at 12:20 to correct the name of the organization. It is the Data Governance Initiative, not the Data Governance Institute.