Pivotal open sources its Hadoop and Greenplum tech, and then some

Pivotal, the cloud computing and big data company that spun out from EMC and VMware in 2013, is open sourcing its entire portfolio of big data technologies and is teaming up with Hortonworks, IBM, GE, and several other companies on a Hadoop effort called the Open Data Platform.

Rumors about the fate of the company’s data business have been circulating since a round of layoffs began in November, but, according to Pivotal, the situation isn’t as dire as some initial reports suggested.

There is a lot of information coming out of the company about this, but here are the key parts:

  • Pivotal is still selling licenses and support for its Greenplum, HAWQ and GemFire database products, but it is also releasing the core code bases for those technologies as open source.
  • Pivotal is still offering its own Hadoop distribution, Pivotal HD, but has slowed development on core components of MapReduce, YARN, Ambari and the Hadoop Distributed File System. Those four pieces are the starting point for a new association called the Open Data Platform, which includes Pivotal, [company]GE[/company], [company]Hortonworks[/company], [company]IBM[/company], Infosys, Pivotal, SAS, Altiscale, [company]EMC[/company], [company]Verizon[/company] Enterprise Solutions, [company]VMware[/company], [company]Teradata[/company] and “a large international telecommunications firm,” and which promises to build its Hadoop technologies using a standard core of code.
  • Pivotal is working with Hortonworks to make Pivotal’s big data technologies run on the Hortonworks Data Platform, and eventually on the Open Data Platform core. Pivotal will continue offering enterprise support for Pivotal HD, although it will outsource to Hortonworks support requests involving the guts of Hadoop (e.g., MapReduce and HDFS).

Sunny Madra, vice president of the data and mobile product group at Pivotal, said the company has a relatively successful big data business already — $100 million overall, $40 million of which came from the Big Data Suite license bundle it announced last year — but suggested that it sees the writing on the wall. Open source software is a huge industry trend, and he thinks pushing against it is as fruitless as pushing against cloud computing several years ago.

“We’re starting to see open source pop up as an RFP within enterprises,” he said. “. . . If you’re picking software [today] . . . you’d look to open source.”

pivotalbds

The Pivotal Big Data Suite.

Madra pointed to Pivotal’s revenue numbers as proof the company didn’t open source its software because no one wanted to pay for it. “We wouldn’t have a $100 million business . . . if we couldn’t sell this,” he said. Maybe, but maybe not: Hortonworks isn’t doing $100 million a year, but word was that Cloudera was doing it years ago (on Tuesday, Cloudera did claim more than $100 million in revenue in 2014). Depending how one defines “big data,” companies like Microsoft and Oracle are probably making much more money.

However, there were some layoffs late last year, which Madra attributed to consolidation of people, offices and efforts rather than a failing business. Pivotal wanted to close some global offices and bring the data team and Cloud Foundry teams under the same leadership, and to focus its development resources on its own intellectual property around Hadoop. “Do we really need a team going and testing our own distribution?” he asked, troubleshooting it, certifying it against technologies and all that goes along with that?

EMC first launched the Pivotal HD Hadoop distribution, as well as the HAWQ SQL-on-Hadoop engine, with much ado just over two years ago.

The deal with Hortonworks helps alleviate that engineering burden in the short term, and the Open Data Platform is supposed to help solve it over a longer period. Madra explained the goal of the organization as Linux-like, meaning that customers should be able to switch from one Hadoop distribution to the next and know the kernel will be the same, just like they do with the various flavors of the Linux operating system.

Mike Olson, Cloudera’s chief strategy officer and founding CEO, offered a harsh rebuttal to the Open Data Platform in a blog post on Tuesday, questioning the utility and politics of vendor-led consortia like this. He simultaneously praised Hortonworks for its commitment to open source Hadoop and bashed Pivotal on the same issue, but wrote, among other things, of the Open Data Platform: “The Pivotal and Hortonworks alliance, notwithstanding the marketing, is antithetical to the open source model and the Apache way.”

The Pivotal HD and Hawq architecture

Much of this has been open sourced or replaced.

As part of Pivotal’s Tuesday news, the company also announced additions to its Big Data Suite package, including the Redis key-value store, RabbitMQ messaging queue and Spring XD data pipeline framework, as well as the ability to run the various components on the company’s Cloud Foundry platform. Madra actually attributes a lot of Pivotal’s decision to open source its data technologies, as well as its execution, to the relative success the company has had with Cloud Foundry, which has always involved an open source foundation as well as a commercial offering.

“Had we not had the learnings that we had in Cloud Foundry, then I think it would have been a lot more challenging,” he said.

Whether or not one believes Pivotal’s spin on the situation, though, the company is right in realizing that it’s open source or bust in the big data space right now. They have different philosophies and strategies around it, but major Hadoop vendors Cloudera, Hortonworks and MapR are all largely focused on open-source technology. The most popular Hadoop-ecosystem technologies, including Spark, Storm and Kafka, are open source, as well. (CEOs, founders and creators from many of these companies and projects will be speaking at our Structure Data conference next month in New York.)

Pivotal might eventually sell billions of dollars worth of software licenses for its suite of big data products — there’s certainly a good story there if it can align the big data and Cloud Foundry businesses into a cohesive platform — but it probably has reached its plateau without having an open source story to tell.

Update: This post was updated at 12:22 p.m. PT to add information about Cloudera’s revenue.

Hadoop’s New Role: Adjunct Data Warehouse

There was a time, a little over two years ago, when SQL-on-Hadoop was about cracking open access to Hadoop data for those with SQL skillsets and eliminating the exclusivity of access that Hadoop/MapReduce specialists had on the data. Yes, some architectural details – like whether the SQL engine was hitting the data nodes in the Hadoop cluster directly – were important too. But, for the most part, solutions in the space were neatly summed up by the name: SQL, on Hadoop.

Today, SQL-on-Hadoop solutions are best judged not by their SQL engines per se, but instead by the collaborative scenarios they enable between Hadoop and the conventional data warehouse. Hadoop can be seen as a usurper, peer or peripheral of the data warehouse; the SQL-on-Hadoop engine you use determines which one (or more) of these three roles Hadoop can be implemented to fulfill.

In Gigaom Research’s just-published Sector Roadmap: Hadoop/Data Warehouse Interoperability, analyst George Gilbert investigates the SQL-on-Hadoop market, evaluating six solutions, each along six “disruption vectors” or key trends that will affect the market and players over the next year: schema flexibility, data engine interoperability, pricing model, enterprise manageability, workload role optimization and query engine maturity.

Scenarios

As a backdrop to the evaluation of various SQL-on-Hadoop products along these vectors, Gilbert identifies three key analytics usage scenarios. The first is the core data warehouse, a familiar concept for many tech professionals: a relatively expensive appliance-based database platform serving up highly-curated data, with the data’s structure optimized for the kinds of queries the business believes it needs to run.

The second is the so-called “data lake” (called an “enterprise data hub” by some vendors). Here, Hadoop serves as a collecting point for disparate data sources along the full spectrum of unstructured, semi-structured and fully-structured data. Hadoop 2.0’s YARN resource manager facilitates the use of a variety of analysis engines to explore the lake’s data in an ad hoc fashion, and the data warehouse is relieved of this responsibility, free to serve the production queries for which it was designed and tuned.

The third scenario Gilbert identifies is one he calls the “adjunct data warehouse,” wherein various data warehouse tasks – including ETL and reporting – are offloaded from the conventional data warehouse to Hadoop. In fact, the adjunct data warehouse can and should be used to perform these functions on data first explored in the data lake.

Screen-Shot-2015-01-28-at-9_33_42-AM-e1422475808971

In effect, the core data warehouse, adjunct data warehouse and data lake constitute a data processing hierarchy, with a corresponding hierarchy of cost. The hierarchical selection of platforms enables tasks of lower production value (though, arguably, higher business value) to be processed on cheaper platforms – yielding much higher efficiency for enterprise organizations.

Enterprise ROI

How much cheaper? Gilbert notes that Hadoop costs at least an order of magnitude less, per terabyte of data, than appliance-based data warehouses. As Hadoop enables the data lake and adjunct data warehouse scenarios, implementation of them gives Hadoop a significant and demonstrable return-on-investment for enterprise customers.

An open question is whether and when Hadoop can and will serve in a core data warehouse capacity as well. And if it does, will that help the data warehouse vendors, the Hadoop distribution vendors or both? Indeed, this dynamic may be a predictor of future acquisitions of the distribution vendors by the legacy players — or perhaps even the reverse.

Hadoop’s new role: Adjunct data warehouse

There was a time, a little over two years ago, when SQL-on-Hadoop was about cracking open access to Hadoop data for those with SQL skillsets and eliminating the exclusivity of access that Hadoop/MapReduce specialists had on the data. Yes, some architectural details – like whether the SQL engine was hitting the data nodes in the Hadoop cluster directly – were important too. But, for the most part, solutions in the space were neatly summed up by the name: SQL, on Hadoop.

Today, SQL-on-Hadoop solutions are best judged not by their SQL engines per se, but instead by the collaborative scenarios they enable between Hadoop and the conventional data warehouse. Hadoop can be seen as a usurper, peer or peripheral of the data warehouse; the SQL-on-Hadoop engine you use determines which one (or more) of these three roles Hadoop can be implemented to fulfill.

In Gigaom Research’s just-published Sector Roadmap: Hadoop/Data Warehouse Interoperability, analyst George Gilbert investigates the SQL-on-Hadoop market, evaluating six solutions, each along six “disruption vectors” or key trends that will affect the market and players over the next year: schema flexibility, data engine interoperability, pricing model, enterprise manageability, workload role optimization and query engine maturity.

Scenarios

As a backdrop to the evaluation of various SQL-on-Hadoop products along these vectors, Gilbert identifies three key analytics usage scenarios. The first is the core data warehouse, a familiar concept for many tech professionals: a relatively expensive appliance-based database platform serving up highly-curated data, with the data’s structure optimized for the kinds of queries the business believes it needs to run.

The second is the so-called “data lake” (called an “enterprise data hub” by some vendors). Here, Hadoop serves as a collecting point for disparate data sources along the full spectrum of unstructured, semi-structured and fully-structured data. Hadoop 2.0’s YARN resource manager facilitates the use of a variety of analysis engines to explore the lake’s data in an ad hoc fashion, and the data warehouse is relieved of this responsibility, free to serve the production queries for which it was designed and tuned.

The third scenario Gilbert identifies is one he calls the “adjunct data warehouse,” wherein various data warehouse tasks – including ETL and reporting – are offloaded from the conventional data warehouse to Hadoop. In fact, the adjunct data warehouse can and should be used to perform these functions on data first explored in the data lake.

Screen-Shot-2015-01-28-at-9_33_42-AM-e1422475808971

In effect, the core data warehouse, adjunct data warehouse and data lake constitute a data processing hierarchy, with a corresponding hierarchy of cost. The hierarchical selection of platforms enables tasks of lower production value (though, arguably, higher business value) to be processed on cheaper platforms – yielding much higher efficiency for enterprise organizations.

Enterprise ROI

How much cheaper? Gilbert notes that Hadoop costs at least an order of magnitude less, per terabyte of data, than appliance-based data warehouses. As Hadoop enables the data lake and adjunct data warehouse scenarios, implementation of them gives Hadoop a significant and demonstrable return-on-investment for enterprise customers.

An open question is whether and when Hadoop can and will serve in a core data warehouse capacity as well. And if it does, will that help the data warehouse vendors, the Hadoop distribution vendors or both? Indeed, this dynamic may be a predictor of future acquisitions of the distribution vendors by the legacy players — or perhaps even the reverse.

Big data pipeline service Treasure Data raises $15M

Treasure Data, a Mountain View, California, startup offering a cloud-based big data platform, has raised a $15 million series B round of venture capital. Scale Ventures led the round, which also include AME Cloud Ventures and the company’s existing investors. The company has now raised about $23 million since it was founded in late 2011.

Treasure Data is similar to other cloud-based startups such as Altiscale and Qubole, but it talks less about Hadoop and more about big data, generally. It’s built on Hadoop, as well as other foundational pieces, on the Amazon Web Services cloud, but tries to abstract a lot of the technological underpinning underneath SQL and the user interface.

The company claims a sweet spot in connected devices and other sources of streaming data. One customer, Pioneer, uses Treasure Data to collect and process telematic data streaming from cars equipped with a certain type of on-board computer.

Treasure Data is also quite popular in Japan, where its founder Hiro Yoshikawa and Kaz Ota are from, and where the company maintains an office. It has a white-label deal in place with Yahoo Japan via which Yahoo essentially resells a version of Treasure Data as a cloud service for analyzing marketing data.

Apache Drill drills up to top-level project

Apache Drill has graduated to Top-Level Project status. Is it merely ready for primetime or will it succeed there? And does the Big Data world need another SQL-on-Hadoop engine?

SQL-on-Hadoop startup Splice Machine adds $3M in funding

Splice Machine, a San Francisco-based startup promising to turn HBase into a relational database that can even handle transactional workloads, has added $3 million to its series B round of venture capital. Correlation Ventures led the latest cash infusion, which is in addition to the $15 million that Interwest Partners and Mohr Davidow Ventures invested into Splice Machine in February. The SQL-on-Hadoop space hasn’t been too good to startups (see, e.g., the fates of Hadapt, Drawn to Scale and even Karmasphere) but perhaps Splice Machine, which has the advantage operating in a more-mature Hadoop market, will be an exception.