The Hadoop wars, HP cloud(s) and IBM’s big win

[soundcloud url=”https://api.soundcloud.com/tracks/194323297″ params=”color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false” width=”100%” height=”166″ iframe=”true” /]

If you are confused about Hewlett-Packard’s cloud plan of action, this week’s guest will lay it out for you. Bill Hilf, the SVP of Helion product management, makes his second appearance on the show (does that make him our Alec Baldwin?) to talk about HP’s many-clouds-one-management layer strategy.

The lineup? Helion Eucalyptus for private clouds which need Amazon Web Services API compatibility; Helion OpenStack for the rest and [company]HP[/company] Cloud Development Platform (aka Cloud Foundry) for platform as a service.  Oh and there’s HP Public Cloud which I will let him tell you about himself.

But first Derrick Harris and I are all over IBM’s purchase of AlchemyAPI, the cool deep learning startup that does stuff like identifying celebs and wanna-be celebs from their photos. It’s a win for [company]IBM [/company]because all that coolness will be sucked into Watson and expand the API set Watson can parlay for more useful work. (I mean, winning Jeopardy is not really a business model, as IBM Watson exec Mike Rhodin himself has pointed out.) 

At first glance it might seem that a system that can tell the difference between Will Ferrell and Chad Smith might be similarly narrow, but after consideration you can see how that fine-grained, self-teaching technology could find broader uses.

AlchemyAPI CEO Elliot Turner and IBM Watson sales chief Stephen Gold shared the stage at Structure Data last year. Who knows what deals might be spawned at this year’s event?

Celebrity_ChadSmith_WillFerrell_cropped

Also we’re happy to follow the escalating smack talk in the Hadoop arena as Cloudera CEO Tom Reilly this week declared victory over the new [company]Hortonworks[/company]-IBM-Pivotal-backed Open Data Platform effort which we’re now fondly referring to as the ABC or “Anyone But Cloudera” alliance.

It’s a lively show so have a listen and (hopefully) enjoy.

SHOW NOTES

Hosts: Barb Darrow, Derrick Harris and Jonathan Vanian

Download This Episode

Subscribe in iTunes

The Structure Show RSS Feed

PREVIOUS EPISODES:

Mark Cuban on net neutrality: the FCC can’t protect competition. 

Microsoft’s machine learning guru on why data matters sooooo much 

No, you don’t need a ton of data to do deep learning 

VMware wants all those cloud workloads “marooned” in AWS

Don’t like your cloud vendor? Wait a second.

 

 

 

Cloudera CEO declares victory over big data competition

Cloudera CEO Tom Reilly doesn’t often mince words when it comes to describing his competition in the Hadoop space, or Cloudera’s position among those other companies. In October 2013, Reilly told me he didn’t consider Hortonworks or MapR to be Cloudera’s real competition, but rather larger data-management companies such as IBM and EMC-VMware spinoff Pivotal. And now, Reilly says, “We declare victory over at least one of our competitors.”

He was referring to Pivotal, and the Open Data Platform, or ODP, alliance it helped launched a couple weeks ago along with [company]Hortonworks[/company], [company]IBM[/company], [company]Teradata[/company] and several other big data vendors. In an interview last week, Reilly called that alliance “a ruse and, frankly, a graceful exit for Pivotal,” which laid off a number of employees working on its Hadoop distribution and is now outsourcing most of its core Hadoop development and support to Hortonworks.

You can read more from Reilly below, including his takes on Hortonworks, Hadoop revenues and Spark, as well as some expanded thoughts on the ODP. For more information about the Open Data Platform from the perspectives of the members, you can read our coverage of its launch in mid-February as well as my subsequent interview with Hortonworks CEO Rob Bearden, who explains in some detail how that alliance will work.

If you want to hear about the fast-changing, highly competitive and multi-billion-dollar business of big data straight from horses’ mouths, make sure to attend our Structure Data conference March 18 and 19 in New York. Speakers include Cloudera’s Reilly and Hortonworks’ Bearden, as well as MapR CEO John Schroeder, Databricks CEO (and Spark co-creator) Ion Stoica, and other big data executives and users, including those from large firms such as [company]Lockheed Martin[/company] and [company]Goldman Sachs[/company].

GIGAOM STRUCTURE DATA 2014

You down with ODP? No, not me

While Hortonworks explains the Open Data Platform essentially as a way for member companies to build on top of Hadoop without, I guess, formally paying Hortonworks for support or embracing its entire Hadoop distribution, Reilly describes it as little more than a marketing ploy. Aside from calling it a graceful exit for Pivotal (and, arguably, IBM), he takes issue with even calling it “open.” If the ODP were truly open, he said, companies wouldn’t have to pay for membership, Cloudera would have been invited and, when it asked about the alliance, it wouldn’t have been required to sign a non-disclosure agreement.

What’s more, Reilly isn’t certain why the ODP is really necessary technologically. It’s presently composed of four of the most mature Hadoop components, he explained, and a lot of companies are actually trying to move off of MapReduce (to Spark or other processing engines) and, in some cases, even the Hadoop Distributed File System. Hortonworks, which supplied the ODP core and presumably will handle much of the future engineering work, will be stuck doing the other members’ bidding as they decide which of several viable SQL engines and other components to include, he added.

“I don’t think we could have scripted [the Open Data Platform news] any better,” Reilly said. He added, “[T]he formation of the ODP … is a big shift in the landscape. We think it’s a shift to our advantage.”

(If you want a possibly more nuanced take on the ODP, check out this blog post by Altiscale CEO Raymie Stata. Altiscale is an ODP member, but Stata has been involved with the Apache Software Foundation and Hadoop since his days as Yahoo CTO and is a generally trustworthy source on the space.)

Hortonworks CEO Rob Bearden at Structure Data 2014.

Hortonworks CEO Rob Bearden at Structure Data 2014.

Really, Hortonworks isn’t a competitor?

Asked about the competitive landscape among Hadoop vendors, Reilly doubled down on his assessment from last October, calling Cloudera’s business model “a much more aggressive play [and] a much bolder vision” than what Hortonworks and MapR are doing. They’re often “submissive” to partners and treat Hadoop like an “add-on” rather than a focal point. If anything, Hortonworks has burdened itself by going public and by signing on to help prop up the legacy technologies that IBM and Pivotal are trying to sell, Reilly said.

Still, he added, Cloudera’s “enterprise data hub” strategy is more akin to the IBM and Pivotal business models of trying to become the centerpiece of customers’ data architectures by selling databases, analytics software and other components beside just Hadoop.

If you don’t buy that logic, Reilly has another argument that boils down to money. Cloudera earned more than $100 million last year (that’s GAAP revenue, he confirmed), while Hortonworks earned $46 million and, he suggested, MapR likely earned a similar number. Combine that with Cloudera’s huge investment from Intel in 2014 — it’s now “the largest privately funded enterprise software company in history,” Reilly said — and Cloudera owns the Hadoop space.

“We intend to take advantage” of this war chest to acquire companies and invest in new products, Reilly said. And although he wouldn’t get into specifics, he noted, “There’s no shortage of areas to look in.”

Diane Bryant, senior vice president and general manager of Intel's Data Center Group, at Structure 2014.

Diane Bryant, senior vice president and general manager of Intel’s Data Center Group, at Structure 2014.

The future is in applications

Reilly said that more than 60 percent of Cloudera sales are now “enterprise data hub” deployments, which is his way of saying its customers are becoming more cognizant of Hadoop as an application platform rather than just a tool. Yes, it can still store lots of data and transform it into something SQL databases can read, but customers are now building new applications for things like customer churn and network optimization with Hadoop as the core. Between 15 and 20 financial services companies are using Cloudera to power detect money laundering, he said, and Cloudera has trained its salesforce on a handful of the most popular use cases.

One of the technologies helping make Hadoop look a lot better for new application types is Spark, which simplifies the programming of data-processing jobs and runs them a lot faster than MapReduce does. Thanks to the YARN cluster-management framework, users can store data in Hadoop and process it using Spark, MapReduce and other processing engines. Reilly reiterated Cloudera’s big investment and big bet on Spark, saying that he expects a lot of workloads will eventually run on it.

Databricks CEO (and AMPLab co-director) Ion Stoica.

Databricks CEO (and Spark co-creator) Ion Stoica.

A year into the Intel deal and …

“It is a tremendous partnership,” Reilly said.

[company]Intel[/company] has been integral in helping Cloudera form partnerships with companies such as Microsoft and EMC, as well as with customers such as MasterCard, he said. The latter deal is particularly interesting because Cloudera and Intel’s joint engineering on hardware-based encryption helped Cloudera deploy a PCI-compliant Hadoop cluster and MasterCard is now out pushing that system to its own clients via its MasterCard Advisors professional services arm.

Reilly added that Cloudera and Intel are also working together on new chips designed specifically for analytic workloads, which will take advantage of non-RAM memory types.

Asked whether Cloudera’s push to deploy more workloads in cloud environments is at odds with Intel’s goal to sell more chips, Reilly pointed to Intel’s recent strategy of designing chips especially for cloud computing environments. The company is operating under the assumption that data has gravity and that certain data that originates in the cloud, such as internet-of-things or sensor data, will stay there, while large enterprises will continue to store a large portion of their data locally.

Wherever they run, Reilly said, “[Intel] just wants more workloads.”

Microsoft’s machine learning guru on why data matters sooooo much

[soundcloud url=”https://api.soundcloud.com/tracks/191875439″ params=”color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false” width=”100%” height=”166″ iframe=”true” /]

Not surprisingly, Joseph Sirosh, has big ambitions for his product portfolio at Microsoft which includes Azure ML, HDInsight and other tools. Chief among them is making it easy for mere mortals to consume these data services from the applications they’re familiar with. Take Excel for example.

If a financial analyst can, with a few clicks, send data to a forecast service in the cloud, then get the numbers back, visualized on the same spreadsheet, that’s a pretty powerful story, said Sirosh who is corporate VP of machine learning for Microsoft.

But as valuable as those applications and services are, more and more of the value to be derived from computation over time will be the data itself, not all those tech underpinnings.  “In the future a huge part of the value generated from computing will come from the data as opposed to storage and operating systems and basic infrastructure,” he noted on this week’s podcast. WHich is why one topic under discussion at next month’s Structure Data show will be who owns all the data flowing betwixt and betweeen various systems, the internet of things etc.

When it comes to getting corporations running these new systems [company]Microsoft[/company] may have an ace in the hole because so many of them already use key Microsoft tools — Active Directory, SQL Server, Excel. That gives them a pretty good on-ramp to Microsoft Azure and its resident services. Sirosh makes a compelling case and we’ll talk to him more on stage at Structure Data next month in New York City.

In the first half of the show, Derrick Harris and I talk about the Hadoop world has returned to its feisty and oh so interesting roots. When Pivotal announced its plan to offload support of Hadoop to [company]Hortonworks[/company] and work with that company along with [company]IBM[/company], [company]GE[/company] on  the Open Data Platform the response from Cloudera CEO Mike Olsen in a blog post with his take. 

Also on the docket, @WalmartLabs massive OpenStack production private cloud implementation.

Joesph Sirosh

Joseph Sirosh

 

SHOW NOTES

Hosts: Barb Darrow and Derrick Harris.

Download This Episode

Subscribe in iTunes

The Structure Show RSS Feed

PREVIOUS EPISODES:

No, you don’t need a ton of data to do deep learning 

VMware wants all those cloud workloads “marooned” in AWS

Don’t like your cloud vendor? Wait a second.

Hilary Mason on taking big data from theory to reality

On the importance of building privacy into apps and Reddit AMAs

Cloudera claims more than $100M in revenue in 2014

Hadoop vendor Cloudera announced on Tuesday that the company’s “[p]reliminary unaudited total revenue surpassed $100 million” in 2014. That the company, which is still privately held, would choose to disclose even that much information about its finances speaks to the fast maturation, growing competition and big egos in the Hadoop space.

While $100 million is a nice, round benchmark number, the number by itself doesn’t mean much of anything. We still don’t know how much profit Cloudera made last year or, more likely, how big of a loss it sustained. What we do know, however, is that it earned more than bitter rivals [company]Hortonworks[/company] (it claimed $33.4 million through the first nine months of 2014, and will release its first official earnings report next week) and probably MapR (I’ve reached out to MapR about this and will update this if I’m wrong). However, Cloudera claims 525 customers are paying for its software (an 85 percent improvement since 2013), while MapR in December claimed more than 700 paying customers.

Cloudera also did about as much business as EMC-VMware spinoff Pivotal claims its big data business did in 2014. On Tuesday, Pivotal open sourced much of its Hadoop and database technology, and teamed up with Hortonworks and a bunch of software vendors large and small to form a new Hadoop alliance called the Open Data Platform. Cloudera’s Mike Olson, the company’s chief strategy officer and founding CEO, called the move, essentially, disingenuous and more an attempt to save Pivotal’s business than a real attempt to advance open source Hadoop software.

Hortonworks CEO Rob Bearden at Structure Data 2014.

Hortonworks CEO Rob Bearden at Structure Data 2014.

All of this grandstanding and positioning is part of a quest to secure business in a Hadoop market that analysts predict will be worth billions in the years to come, and also an attempt by each company to prove to potential investors that its business model is the best. Hortonworks surprised a lot of people by going pubic in December, and the stock has remained stable since then (although its share price dropped more than two percent on Tuesday despite the news with Pivotal). Many people suspect Cloudera and MapR will go public this year, and Pivotal at some point as well.

This much action should make for an entertaining and informative Structure Data conference, which is now less than a month away. We’ll have the CEOs of Cloudera, Hortonworks and MapR all on stage talking about the business of big data, as well as the CEO of Apache Spark startup Databricks, which might prove to be a great partner for Hadoop vendors as well as a thorn in their sides. Big users, including Goldman Sachs, ESPN and Lockheed Martin, will also be talking about the technologies and objectives driving their big data efforts.

Cloudera acquires self-service data-modeling startup Xplain.io

Hadoop vendor Cloudera is moving closer to the business intelligence space by acquiring a startup called Xplain.io. The company’s software analyzes users’ offline queries in order to determine which ones are most important to the business and how they might be improved upon.

Here’s how the Xplain.io website, well, explains the way its software works:

  • Step 1. Xplain.io only needs your queries to profile and understand the business logic. Intuitive dashboards act as mission control for the workload, showing what queries are critical to your business, what these queries do and which data is accessed most often.
  • Step 2. Generate schemas from your actual data usage to map query access patterns. Xplain.io monitors these patterns on a rolling basis, identifying optimizations in the data model and making modernization recommendations.
  • Step 3. Turn recommendations into new data models without having to hand code. Xplain.io batch transforms existing objects at the click of a button, then monitors their ongoing performance.
laptop1

A screenshot of Xplain.io.

For Cloudera, the acquisition represents a chance to help its customers make better use of data by optimizing the queries for better performance and perhaps helping customers find the right data store for the job. That could be in Cloudera Impala or a NoSQL database or even a standard relational database. Anupam Singh, Xplain.io’s co-founder and CEO described part of the company’s rationale and process in a blog post announcing the acquisition:

Xplain.io’s first customer executes nearly 8.4 million (yes, million!) SQL queries annually against various data stores. This begs the question: How many of these queries have access patterns that could benefit from a new data model? The customer did not have a clear answer, and we saw an opportunity. Today, Xplain.io’s profiler is used to identify the most common data access patterns and Xplain.io’s transformation engine is used to generate the schema design for modern data stores such as Impala.

If Cloudera is serious about becoming an enterprise data platform company that lives up to the “enterprise data hub” software it’s selling, these are the types of acquisitions it needs to make and nurture. The name of the game isn’t just getting more data into Hadoop and focusing all development around Hadoop, but working to improve the whole ecosystem of technologies around and above Hadoop, as well.

In the broader Hadoop market, the Xplain.io acquisition is just another move in a game of strategy that has been going on for several years and likely will go for several more to come. Last week, for example, Cloudera rival Hortonworks — which had a successful initial public offering in December — announced a Hadoop data governance initiative along with customers Target, Merck and Aetna, and invited Cloudera to join (hint: it probably won’t). Last year around this time, Cloudera secured a massive investment from Intel and several others that resulted in more than half a billion dollars in the company’s war chest.

Tom Reilly (left) at Structure Data 2014. (c) Jakub Moser / http://jakubmosur.photoshelter.com

Cloudera CEO Tom Reilly (left) at Structure Data 2014.

We’ll hear a lot more about all of this at our Structure Data conference in March, which features talks from Cloudera CEO Tom Reilly, Hortonworks CEO Rob Bearden and MapR CEO John Schroeder.

Xplain.io is Cloudera’s fourth acquisition, with its most recent being an “acquihire” (to use a Silicon Valley term of art) of data science specialists Datapad in September. Xplain.io was founded in October 2013 and had raised money from Mayfield Fund.

Hortonworks and 3 large users launch a project to secure Hadoop

Hadoop vendor Hortonworks, along with customers Target, Merck and Aetna, and software vendor SAS, has started a new group designed to ensure that data stored inside Hadoop systems is only used how it’s supposed to be used and seen by whom it’s supposed to be seen. The effort, called the Data Governance Initiative, will function as an open source project and will address the concerns of enterprises that want to store more data in Hadoop but fear the system won’t match industry regulations or stand up to audits.

The group is similar in spirit to the Open Compute Foundation, which launched in 2011. Facebook spearheaded the Open Compute Project effort and drove a lot of early innovation, but has seen lots of contributions and involvement from technology companies such as Microsoft and end-user companies such as Goldman Sachs. Tim Hall, Hortonworks’ vice president of project management, said Target, Merck and Aetna will be active contributors to the new Hadoop organization — sharing their business and technical expertise in the markets in which they operate, as well as developing and deploying code.

Among the rationale for creating the Data Governance Initiative were questions about the sustainability of the Hortonworks open source business model, some which were brought to light with the revenue numbers it published as part of its initial public offering process, Hall acknowledged. The idea is that this group will demonstrate Hortonworks’ commitment to enterprise concerns and work with large companies to solve them. It will also show how Hortonworks can drive Hadoop innovation without abandoning its open source model.

“We want to make sure folks understand it’s not just these software companies we can work with,” Hall said, referencing the initial phases of Hadoop development led by companies such as Yahoo and Facebook.

A high-level view of Apache Falcon.

A high-level view of Apache Falcon.

Hortonworks plans to publish more information about the Data Governance Initiative’s technical roadmap and early work in February, but the Apache Falcon and Apache Ranger projects that Hortonworks backs will be key components, and there will be an emphasis on maintaining policies as data moves between Hadoop and other data systems. Code will be contributed back to the Apache Software Foundation.

Hall said any companies are welcome to join — including Hadoop rivals such as MapR and Cloudera, which has its own pet projects around Hadoop security —  but, he noted, “It’s up to the other vendors to recognize the value that’s being created here.”

“There’s no reason why Cloudera couldn’t wire up their [Apache] Sentry project to this,” Hall added. “. . . We’d be happy to have them participate in this once it goes into incubator status.”

Of course, Hadoop competition being what it is, he might well suspect that won’t happen anytime soon. Cloudera actually published a well-timed blog post on Wednesday morning touting the security features of its Hadoop distribution.

You can hear all about the Hadoop space at our Structure Data conference in March, where Hortonworks CEO Rob Bearden, Cloudera CEO Tom Reilly and MapR CEO John Schroeder will each share their visions of where the technology is headed.

Update: This post was updated at 12:20 to correct the name of the organization. It is the Data Governance Initiative, not the Data Governance Institute.

The 5 stories that defined the big data market in 2014

There is no other way to put it: 2014 was a huge year for the big data market. It seems years of talk about what’s possible are finally giving way to some real action on the technology front — and there’s a wave of cash following close behind it.

Here are the five stories from the past year that were meaningful in their own rights, but really set the stage for bigger things to come. We’ll discuss many of these topics in depth at our Structure Data conference in March, but until then feel free to let me know in the comments what I missed, where I went wrong or why I’m right.

5. Satya Nadella takes the reins at Microsoft

Microsoft CEO Satya Nadella has long understood the importance of data to the company’s long-term survival, and his ascendance to the top spot ensures Microsoft won’t lose sight of that. Since Nadella was appointed CEO in February, we’ve already seen Microsoft embrace the internet of things, and roll out new data-centric products such as Cortana, Skype Translate and Azure Machine Learning. Microsoft has been a major player in nearly every facet of IT for decades and how it executes in today’s data-driven world might dictate how long it remains in the game.

Microsoft CEO Satya Nadella speaks at a Microsoft Cloud event. Photo by Jonathan Vanian/Gigaom

Satya Nadella speaks at a Microsoft Cloud event.

4. Apache Spark goes legit

It was inevitable that the Spark data-processing framework would become a top-level project within the Apache Software Foundation, but the formal designation felt like an official passing-of-the-torch nonetheless. Spark promises to do for the Hadoop ecosystem all the things MapReduce never could around speed and usability, so it’s no wonder Hadoop vendors, open source projects and even some forward-thinking startups are all betting big on the technology. Databricks, the first startup trying to commercialize Spark, has benefited from this momentum, as well.

Ion Stoica

Spark co-creator and Databricks CEO Ion Stoica.

3. IBM bets its future on Watson

Big Blue might have abandoned its server and microprocessor businesses, but IBM is doubling down on cognitive computing and expects its new Watson division to grow into a $10 billion business. The company hasn’t wasted any time trying to get the technology into users’ hands — it has since announced numerous research and commercial collaborations, highlighted applications built atop Watson and even worked Watson tech into the IBM cloud platform and a user-friendly analytics service. IBM’s experiences with Watson won’t only affect its bottom line; they could be a strong indicator of how enterprises will ultimately use artificial intelligence software.

watson headquarters

A shot of IBM’s new Watson division headquarters in Manhattan.

2. Google buys DeepMind

It’s hard to find a more exciting technology field than artificial intelligence right now, and deep learning is the force behind a lot of that excitement. Although there were a myriad of acquisitions, startup launches and research breakthroughs in 2014, it was Google’s acquisition of London-based startup DeepMind in January that set the tone for the year. The price tag, rumored to be anywhere from $450 million to $628 million, got the mainstream technology media paying attention, and it also let deep learning believers (including those at competing companies) know just how important deep learning is to Google.

Jeffrey Dean - Google Fellow, Google

Google’s Jeff Dean talks about early deep learning results at Structure 2013.

1. Hortonworks goes public

Cloudera’s massive (and somewhat convoluted) deal with Intel boosted the company’s valuation past $4 billion and sent industry-watchers atwitter, but the Hortonworks IPO in December was really a game-changer. It came faster than most people expected, was more successful than many people expected, and should put the pressure on rivals Cloudera and MapR to act in 2015. With a billion-plus-dollar market cap and public market trust, Hortonworks can afford to scale its business and technology — and maybe even steal some valuable mindshare — as the three companies vie to own what could be a humongous software market in a few years’ time.

nasdaqhdp

Hortonworks rings the opening bell on its IPO day.

Honorable mentions

Why the Hortonworks IPO could be a bellwether for Hadoop

Hortonworks’ IPO filing on Monday shows that Hadoop is still a resource- and risk-intensive business, but also suggests it’s one that public market investors will be willing to back. It might also start the ball rolling for long-anticipated moves in Hadoop.

Cloudera now lets you manage cloud instances just like the real thing

Cloudera has announced a new product called Director that will make it easier for customers to manage their Hadoop clusters on the Amazon Web Services cloud. It’s likely the first of many moves Cloudera will make to expand its presence outside of customer data centers.