Turning data scientists into action heroes: The rise of self-service Hadoop

Mike is chief operating officer at Altiscale.
The unfortunate truth about data science professionals is that they spend a shockingly small amount of time actually exploring data. Instead, they are stuck devoting significant amounts of time wrangling data and pouring resources into the tedious act of prepping and managing it.
While Hadoop excels at turning massive amounts of data into valuable insights, it’s also a notorious culprit for sucking up resources. In fact, these hurdles are serious bottlenecks to big data success, with research firm Gartner predicting that through 2018, 70 percent of Hadoop deployments will not meet cost savings and revenue generation objectives due to skills and integration challenges.
Whether it’s time stuck in a queue behind higher priority jobs or functioning as a Hadoop operations person, — building their own clusters, accessing data sources, and running and troubleshooting jobs — data scientists are wasting time on administrative tasks. Sure, it’s necessary to do some heavy lifting to successfully perform analysis on data. But it isn’t the best use of a data scientist’s time, and it’s a drain on an organization’s resources.
That said, how can data scientists stop serving as substitute Hadoop administrators and become analytics action heroes?
Just as the business intelligence industry has moved to a more self-service model, the Hadoop industry is also moving to a self-service model. Operational challenges are moving to the background, so that data scientists are liberated to spend more time building models, exploring data, testing hypotheses, and developing new analytics.
Self-service Hadoop solutions simplify, streamline, and automate the steps needed to create a data exploration environment. Self-service is achieved when a provider (one who runs and operates a scalable, secure Hadoop environment) delivers a data science platform for the analytics team.
With a self-service environment, data scientists can focus on the data analysis, while being confident that the data and Hadoop operations are well taken care of. And these environments can be kept separate from production environments, ensuring that test data science jobs don’t interfere with a production Hadoop environment that is core to business operations, thereby reducing risk of operational mishaps.
As we see a rise in self-service Hadoop, organizations will realize the benefits of analytics action heroes and their super power contributions. Here are a few reasons why:

  • Faster understanding of trends and correlations that drive business action: Self-service tools eliminate the complex and time-consuming steps of procuring and provisioning hardware, installing and configuring Hadoop and managing clusters in production. By automating issues that customers run into in production, such as job failures, resource contention, performance optimization and infrastructure upgrades, data analytics projects run with more ease and speed.
  • Freedom to take risks with more agile data science and analytics teams: Using the latest self-service technology in the Hadoop ecosystem, organizations can gain a competitive edge not previously possible. Teams can experiment with advanced technology in a production environment, without the overhead associated with maintaining an on-premise solution. This allows data scientists to develop cutting-edge products that leverage features in the most advanced software available.
  • Increased time for Hadoop experts to focus on value-added tasks: Operational stability frees up internal resources so Hadoop experts can focus on unearthing data insights and other value-added tasks such as data modeling insights. Simply put, with more time spent on examining the data rather than wrangling it, organizations can uncover insights that drive business forward — and deliver on the true promise of big data.

Hadoop has unlimited potential to drive business forward. Yet, it can quickly become a drain on internal operational resources when running in production and at scale. Organizations need to devote more time on data science and not on the Hadoop infrastructure to fully realize big data’s potential — self-service tools make this a reality.

Airbnb open sources SQL tool built on Facebook’s Presto database

Apartment-sharing startup Airbnb has open sourced a tool called Airpal that the company built to give more of its employees access to the data they need for their jobs. Airpal is built atop the Presto SQL engine that Facebook created in order to speed access to data stored in Hadoop.

Airbnb built Airpal about a year ago so that employees across divisions and roles could get fast access to data rather than having to wait for a data analyst or data scientist to run a query for them. According to product manager James Mayfield, it’s designed to make it easier for novices to write SQL queries by giving them access to a visual interface, previews of the data they’re accessing, and the ability to share and reuse queries.

It sounds a little like the types of tools we often hear about inside data-driven companies like Facebook, as well as the new SQL platform from a startup called Mode.

At this point, Mayfield said, “Over a third of all the people working at Airbnb have issued a query through Airpal.” He added, “The learning curve for SQL doesn’t have to be that high.”

He shared the example of folks at Airbnb tasked with determining the effectiveness of the automated emails the company sends out when someone books a room, resets a password or takes any of a number of other actions. Data scientists used to have to dive into Hive — the SQL-like data warehouse framework for Hadoop that [company]Facebook[/company] open sourced in 2008 — to answer that type of question, which meant slow turnaround times because of human and technological factors. Now, lots of employees can access that same data via Airpal in just minutes, he said.

The Airpal user interface.

The Airpal user interface.

As cool as Airpal might be for Airbnb users, though, it really owes its existence to Presto. Back when everyone was using Hive for data analysis inside Hadoop — it was and continues to be widely used within web companies — only 10 to 15 people within Airbnb understood the data and could write queries using its somewhat complicated version of SQL. Because Hive is based on MapReduce, the batch-processing engine most commonly associated with Hadoop, Hive is also slow (although new improvements have increased its speed drastically).

Airbnb also used [company]Amazon[/company]’s Redshift cloud data warehouse for a while, said software engineer Andy Kramolisch, and while it was fast, it wasn’t as user-friendly as the company would have liked. It also required replicating data from Hive, meaning more work for Airbnb and more data for the company to manage. (If you want to hear more about all this Hadoop and big data stuff from leaders at [company]Google[/company], Cloudera and elsewhere, come to our Structure Data conference March 18-19 in New York.)

A couple years ago, Facebook created and then open sourced Presto as a means to solve Hive’s speed problems. It still accesses data from Hive, but is designed to deliver results at interactive speeds rather than in minutes or, depending on the query, much longer. It also uses standard ANSI SQL, which Kramolisch said is easier to learn than the Hive Query Language and its “lots of hidden gotchas.”

Still, Mayfield noted, it’s not as if everyone inside Airbnb, or any company, is going to be running SQL queries using Airpal — no matter how easy the tooling gets. In those cases, he said, the company tries to provide dashboards, visualizations and other tools to help employees make sense of the data they need to understand.

“I think it would be rad if the CEO was writing SQL queries,” he said, “but …”

The Hadoop wars, HP cloud(s) and IBM’s big win

[soundcloud url=”https://api.soundcloud.com/tracks/194323297″ params=”color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false” width=”100%” height=”166″ iframe=”true” /]

If you are confused about Hewlett-Packard’s cloud plan of action, this week’s guest will lay it out for you. Bill Hilf, the SVP of Helion product management, makes his second appearance on the show (does that make him our Alec Baldwin?) to talk about HP’s many-clouds-one-management layer strategy.

The lineup? Helion Eucalyptus for private clouds which need Amazon Web Services API compatibility; Helion OpenStack for the rest and [company]HP[/company] Cloud Development Platform (aka Cloud Foundry) for platform as a service.  Oh and there’s HP Public Cloud which I will let him tell you about himself.

But first Derrick Harris and I are all over IBM’s purchase of AlchemyAPI, the cool deep learning startup that does stuff like identifying celebs and wanna-be celebs from their photos. It’s a win for [company]IBM [/company]because all that coolness will be sucked into Watson and expand the API set Watson can parlay for more useful work. (I mean, winning Jeopardy is not really a business model, as IBM Watson exec Mike Rhodin himself has pointed out.) 

At first glance it might seem that a system that can tell the difference between Will Ferrell and Chad Smith might be similarly narrow, but after consideration you can see how that fine-grained, self-teaching technology could find broader uses.

AlchemyAPI CEO Elliot Turner and IBM Watson sales chief Stephen Gold shared the stage at Structure Data last year. Who knows what deals might be spawned at this year’s event?


Also we’re happy to follow the escalating smack talk in the Hadoop arena as Cloudera CEO Tom Reilly this week declared victory over the new [company]Hortonworks[/company]-IBM-Pivotal-backed Open Data Platform effort which we’re now fondly referring to as the ABC or “Anyone But Cloudera” alliance.

It’s a lively show so have a listen and (hopefully) enjoy.


Hosts: Barb Darrow, Derrick Harris and Jonathan Vanian

Download This Episode

Subscribe in iTunes

The Structure Show RSS Feed


Mark Cuban on net neutrality: the FCC can’t protect competition. 

Microsoft’s machine learning guru on why data matters sooooo much 

No, you don’t need a ton of data to do deep learning 

VMware wants all those cloud workloads “marooned” in AWS

Don’t like your cloud vendor? Wait a second.




Cloudera CEO declares victory over big data competition

Cloudera CEO Tom Reilly doesn’t often mince words when it comes to describing his competition in the Hadoop space, or Cloudera’s position among those other companies. In October 2013, Reilly told me he didn’t consider Hortonworks or MapR to be Cloudera’s real competition, but rather larger data-management companies such as IBM and EMC-VMware spinoff Pivotal. And now, Reilly says, “We declare victory over at least one of our competitors.”

He was referring to Pivotal, and the Open Data Platform, or ODP, alliance it helped launched a couple weeks ago along with [company]Hortonworks[/company], [company]IBM[/company], [company]Teradata[/company] and several other big data vendors. In an interview last week, Reilly called that alliance “a ruse and, frankly, a graceful exit for Pivotal,” which laid off a number of employees working on its Hadoop distribution and is now outsourcing most of its core Hadoop development and support to Hortonworks.

You can read more from Reilly below, including his takes on Hortonworks, Hadoop revenues and Spark, as well as some expanded thoughts on the ODP. For more information about the Open Data Platform from the perspectives of the members, you can read our coverage of its launch in mid-February as well as my subsequent interview with Hortonworks CEO Rob Bearden, who explains in some detail how that alliance will work.

If you want to hear about the fast-changing, highly competitive and multi-billion-dollar business of big data straight from horses’ mouths, make sure to attend our Structure Data conference March 18 and 19 in New York. Speakers include Cloudera’s Reilly and Hortonworks’ Bearden, as well as MapR CEO John Schroeder, Databricks CEO (and Spark co-creator) Ion Stoica, and other big data executives and users, including those from large firms such as [company]Lockheed Martin[/company] and [company]Goldman Sachs[/company].


You down with ODP? No, not me

While Hortonworks explains the Open Data Platform essentially as a way for member companies to build on top of Hadoop without, I guess, formally paying Hortonworks for support or embracing its entire Hadoop distribution, Reilly describes it as little more than a marketing ploy. Aside from calling it a graceful exit for Pivotal (and, arguably, IBM), he takes issue with even calling it “open.” If the ODP were truly open, he said, companies wouldn’t have to pay for membership, Cloudera would have been invited and, when it asked about the alliance, it wouldn’t have been required to sign a non-disclosure agreement.

What’s more, Reilly isn’t certain why the ODP is really necessary technologically. It’s presently composed of four of the most mature Hadoop components, he explained, and a lot of companies are actually trying to move off of MapReduce (to Spark or other processing engines) and, in some cases, even the Hadoop Distributed File System. Hortonworks, which supplied the ODP core and presumably will handle much of the future engineering work, will be stuck doing the other members’ bidding as they decide which of several viable SQL engines and other components to include, he added.

“I don’t think we could have scripted [the Open Data Platform news] any better,” Reilly said. He added, “[T]he formation of the ODP … is a big shift in the landscape. We think it’s a shift to our advantage.”

(If you want a possibly more nuanced take on the ODP, check out this blog post by Altiscale CEO Raymie Stata. Altiscale is an ODP member, but Stata has been involved with the Apache Software Foundation and Hadoop since his days as Yahoo CTO and is a generally trustworthy source on the space.)

Hortonworks CEO Rob Bearden at Structure Data 2014.

Hortonworks CEO Rob Bearden at Structure Data 2014.

Really, Hortonworks isn’t a competitor?

Asked about the competitive landscape among Hadoop vendors, Reilly doubled down on his assessment from last October, calling Cloudera’s business model “a much more aggressive play [and] a much bolder vision” than what Hortonworks and MapR are doing. They’re often “submissive” to partners and treat Hadoop like an “add-on” rather than a focal point. If anything, Hortonworks has burdened itself by going public and by signing on to help prop up the legacy technologies that IBM and Pivotal are trying to sell, Reilly said.

Still, he added, Cloudera’s “enterprise data hub” strategy is more akin to the IBM and Pivotal business models of trying to become the centerpiece of customers’ data architectures by selling databases, analytics software and other components beside just Hadoop.

If you don’t buy that logic, Reilly has another argument that boils down to money. Cloudera earned more than $100 million last year (that’s GAAP revenue, he confirmed), while Hortonworks earned $46 million and, he suggested, MapR likely earned a similar number. Combine that with Cloudera’s huge investment from Intel in 2014 — it’s now “the largest privately funded enterprise software company in history,” Reilly said — and Cloudera owns the Hadoop space.

“We intend to take advantage” of this war chest to acquire companies and invest in new products, Reilly said. And although he wouldn’t get into specifics, he noted, “There’s no shortage of areas to look in.”

Diane Bryant, senior vice president and general manager of Intel's Data Center Group, at Structure 2014.

Diane Bryant, senior vice president and general manager of Intel’s Data Center Group, at Structure 2014.

The future is in applications

Reilly said that more than 60 percent of Cloudera sales are now “enterprise data hub” deployments, which is his way of saying its customers are becoming more cognizant of Hadoop as an application platform rather than just a tool. Yes, it can still store lots of data and transform it into something SQL databases can read, but customers are now building new applications for things like customer churn and network optimization with Hadoop as the core. Between 15 and 20 financial services companies are using Cloudera to power detect money laundering, he said, and Cloudera has trained its salesforce on a handful of the most popular use cases.

One of the technologies helping make Hadoop look a lot better for new application types is Spark, which simplifies the programming of data-processing jobs and runs them a lot faster than MapReduce does. Thanks to the YARN cluster-management framework, users can store data in Hadoop and process it using Spark, MapReduce and other processing engines. Reilly reiterated Cloudera’s big investment and big bet on Spark, saying that he expects a lot of workloads will eventually run on it.

Databricks CEO (and AMPLab co-director) Ion Stoica.

Databricks CEO (and Spark co-creator) Ion Stoica.

A year into the Intel deal and …

“It is a tremendous partnership,” Reilly said.

[company]Intel[/company] has been integral in helping Cloudera form partnerships with companies such as Microsoft and EMC, as well as with customers such as MasterCard, he said. The latter deal is particularly interesting because Cloudera and Intel’s joint engineering on hardware-based encryption helped Cloudera deploy a PCI-compliant Hadoop cluster and MasterCard is now out pushing that system to its own clients via its MasterCard Advisors professional services arm.

Reilly added that Cloudera and Intel are also working together on new chips designed specifically for analytic workloads, which will take advantage of non-RAM memory types.

Asked whether Cloudera’s push to deploy more workloads in cloud environments is at odds with Intel’s goal to sell more chips, Reilly pointed to Intel’s recent strategy of designing chips especially for cloud computing environments. The company is operating under the assumption that data has gravity and that certain data that originates in the cloud, such as internet-of-things or sensor data, will stay there, while large enterprises will continue to store a large portion of their data locally.

Wherever they run, Reilly said, “[Intel] just wants more workloads.”

Hortonworks did $12.7M in Q4, on its path to a billion, CEO says

Hadoop vendor Hortonworks announced its first quarterly earnings as a publicly held company Tuesday, claiming $12.7 million in fourth-quarter revenue and $46 million in revenue during fiscal year 2014. The numbers represent 55 percent quarter-over-quarter and 91 percent year-over-year increases, respectively. The company had a net loss of $90.6 million in the fourth quarter and $177.3 million for the year.

However, [company]Hortonworks[/company] contends that revenue is not the most important number in assessing its business. Rather, as CEO Rob Bearden explained around the time the company filed its S-1 pre-IPO statement in November, Hortonworks’ thinks its total billings are a more accurate representation of its health. That’s because the company relies fairly heavily on professional services, meaning the company often doesn’t get paid until a job is done.

The company’s billings in the fourth quarter totaled $31.9 million, a 148 percent year-over-year increase. Its fiscal year billings were $87.1 million, a 134 percent increase over 2013.

If you buy Bearden’s take on the importance of billings over revenue, then Hortonworks looks a lot more comparable in size to its largest rival, Cloudera. Last week, Cloudera announced more than $100 million in revenue in 2014, as well as an 85 percent increase in subscription software customers up to 525 in total.

Hortonworks, for its part, added 99 customers paying for enterprise support of its Hadoop platform in the fourth quarter alone, bringing its total to 332. Among those customers are Expedia, Macy’s, Blackberry and Spotify, all four of which moved directly to Hortonworks from Cloudera, a Hortonworks spokesperson said.

There are, however, some key differences between the Hortonworks and Cloudera business models, as well as that of fellow vendor MapR, that affect how comparable any of these metrics really are. While Hortonworks is focused on free open source software and relies on support contracts for revenue, Cloudera and MapR offer both free Hadoop distributions as well as more feature-rich paid versions. In late-2013, Cloudera CEO Tom Reilly told me his company was interested in securing big deployments rather than chasing cheap support contracts.

Rob Bearden

Rob Bearden at Structure Data 2014

I had a broad discussion with Bearden last week about the Hadoop market and some of Hortonworks’ recent moves in that space, including the somewhat-controversial Open Data Platform alliance it helped to create along with Pivotal, [company]IBM[/company], [company]GE[/company] and others. Here are the highlights from that interview. (If you want to hear more from Bearden and perhaps ask him some of your own questions, make sure to attend our Structure Data conference March 18 and 19 in New York. Other notable Hadoop-market speakers include Cloudera CEO Tom Reilly, MapR CEO John Schroeder and Databricks CEO (and Spark co-creator) Ion Stoica.)

Explain the rationale behind the Open Data Platform

Bearden wouldn’t comment specifically on criticisms — made most loudly by Cloudera’s Mike Olson and Doug Cutting, as well as some industry analysts — that the Open Data Platform, or ODP, is somehow antithetical to open source or the Apache Software Foundation. “What I would say,” he noted, “is the people who are committed to true open source and an open platform for the community are pretty excited about the thing.”

He also chalked up a lot of the criticism of the ODP to misunderstanding about how it really will work in practice. “One of the things I don’t think is very clear on the Open Data Platform alliance is that we’re actually going to provide what we’ll refer to as the core for that alliance, that is based on core Hadoop — so HDFS and YARN and Ambari,” Bearden explained. “We’re providing that, which is obviously directly from Apache, and it’s the exact same bit line that [the Hortonworks Data Platform] is based on.”

Pivotal CEO Paul Maritz at Structure Data 2014.

Paul Maritz, CEO of Hortonworks partner, and ODP member Pivotal, at Structure Data 2014.

So, the core Hadoop distribution that platform members will use is based on Apache code, and anything that ODP members want to add on top of it will also have to go through Apache. These could be existing Apache projects, or they could be new projects he members decide to start on their own, Bearden said.

“We’re actually strengthening the position of the Apache Software Foundation,” he said. He added later in the interview, on the same point, that people shouldn’t view the ODP as much different than they view Hortonworks (or, in many respects, Cloudera or MapR). “[The Apache Software Foundation] is the engineering arm,” he said, “and this entity will become he productization and packaging arm for [Apache].”

So, it’s Cloudera vs. MapR vs. Hortonworks et al?

I asked Bearden whether the formation of the ODP officially makes the Hadoop market a case of Cloudera and MapR versus the Hortonworks ecosystem. That seems like the case to me, considering that the ODP is essentially providing the core for a handful of potentially big players in the Hadoop space. And even if they’re not ODP members, companies such as [company]Microsoft[/company] and [company]Rackspace[/company] have built their Hadoop products largely on top of the Hortonworks platform and with its help.

Bearden wouldn’t bite. At least not yet.

“I wouldn’t say it’s the other guys versus all of us,” he said. “I would say what’s happened is the community has realized this is what they want and it fits in our model that we’re driving very cleanly. . . . And we’re not doing anything up the stack to try and disintermediate them, and we de-risk it because we’re all open.”

The this he’s referring to is the ability of its partners to stop spending resources keeping up with the core Hadoop technology and instead focus on how they can monetize their own intellectual property. “To do that, the more data they put under management, the faster and the more-stable and enterprise-viable [the platform on which they have that data], the faster they monetize and the bigger they monetize the rest of their platform,” Bearden said.

Microsoft CEO Satya Nadella speaks at a Microsoft cloud event. Photo by Jonathan Vanian/Gigaom

Microsoft CEO Satya Nadella speaks at a Microsoft cloud event about that company’s newfound embrace of open source.

Are you standing by your prediction of a billion-dollar company?

“I am not backing off that at all,” Bearden said, in reference to his prediction at Structure Data last year that Hadoop will soon become a multi-billion-dollar market and Hortonworks will be a billion-dollar company in terms of revenue. He said it’s fair to look at revenue alone is assessing the businesses in this space, but it’s not the be all, end all.

“It’s less about [pure money] and more about what is the ecosystem doing to really start adopting this,” he said. “Are they trying fight it and reject it, or are they really starting to embrace it and pull it through? Same with the big customers. . . .
“When those things are happening, the money shows up. It just does.”

Hadoop is actually just a part — albeit a big one — of a major evolution in the data-infrastructure space, he explained. And as companies start replacing the pieces of their data environments, they’ll do so with the open source options that now dominate new technologies. These include Hadoop, NoSQL databases, Storm, Kafka, Spark and the like.

In fact, Bearden said, “Open source companies can be very successful in terms of revenue growth and in terms of profitability faster than the old proprietary platforms got there.”

Time will tell.

Update: This post was updated at 8:39 p.m. PT to correct the amount of Hortonworks’ fourth quarter revenue and losses. Revenue was $12.7 million, not $12.5 million as originally reported, and losses were $90.6 million for the quarter and $177.3 million for the year. The originally reported numbers were for gross loss.

For now, Spark looks like the future of big data

Titles can be misleading. For example, the O’Reilly Strata + Hadoop World conference took place in San Jose, California, this week but Hadoop wasn’t the star of the show. Based on the news I saw coming out of the event, it’s another Apache project — Spark — that has people excited.

There was, of course, some big Hadoop news this week. Pivotal announced it’s open sourcing its big data technology and essentially building its Hadoop business on top of the [company]Hortonworks[/company] platform. Cloudera announced it earned $100 million in 2014. Lost in the grandstanding was MapR, which announced something potentially compelling in the form of cross-data-center replication for its MapR-DB technology.

But pretty much everywhere else you looked, it was technology companies lining up to support Spark: Databricks (naturally), Intel, Altiscale, MemSQL, Qubole and ZoomData among them.

Spark isn’t inherently competitive with Hadoop — in fact, it was designed to work with Hadoop’s file system and is a major focus of every Hadoop vendor at this point — but it kind of is. Spark is known primarily as an in-memory data-processing framework that’s faster and easier than MapReduce, but it’s actually a lot more. Among the other projects included under the Spark banner are file system, machine learning, stream processing, NoSQL and interactive SQL technologies.

The Spark platform, minus the Tachyon file system and some younger related projects.

The Spark platform, minus the Tachyon file system and some younger related projects.

In the near term, it probably will be that Hadoop pulls Spark into the mainstream because Hadoop is still at least a cheap, trusted big data storage platform. And with Spark still being relatively immature, it’s hard to see too many companies ditching Hadoop MapReduce, Hive or Impala for their big data workloads quite yet. Wait a few years, though, and we might start seeing some more tension between the two platforms, or at least an evolution in how they relate to each other.

This will be especially true if there’s a big breakthrough in RAM technology or prices drop to a level that’s more comparable to disk. Or if Databricks can convince companies they want to run their workloads in its nascent all-Spark cloud environment.

Attendees at our Structure Data conference next month in New York can ask Spark co-creator and Databricks CEO Ion Stoica all about it — what Spark is, why Spark is and where it’s headed. Coincidentally, Spark Summit East is taking place the exact same days in New York, where folks can dive into the nitty gritty of working with the platform.

There were also a few other interesting announcements this week that had nothing to do with Spark, but are worth noting here:

  • [company]Microsoft[/company] added Linux support for its HDInsight Hadoop cloud service, and Python and R programming language support for its Azure ML cloud service. The latter also now lets users deploy deep neural networks with a few clicks. For more on that, check out the podcast interview with Microsoft Corporate Vice President of Machine Learning (and Structure Data speaker) Joseph Sirosh embedded below.
  • [company]HP[/company] likes R, too. It announced a product called HP Haven Predictive Analytics that’s powered by a distributed version of R developed by HP Labs. I’ve rarely heard HP and data science in the same sentence before, but at least it’s trying.
  • [company]Oracle[/company] announced a new analytic tool for Hadoop called Big Data Discovery. It looks like a cross between Platfora and Tableau, and I imagine will be used primarily by companies that already purchase Hadoop in appliance form from Oracle. The rest will probably keep using Platfora and Tableau.
  • [company]Salesforce.com[/company] furthered its newfound business intelligence platform with a handful of features designed to make the product easier to use on mobile devices. I’m generally skeptical of Salesforce’s prospects in terms of stealing any non-Salesforce-related analytics from Tableau, Microsoft, Qlik or anyone else, but the mobile angle is compelling. The company claims more than half of user engagement with the platform is via mobile device, which its Director of Product Marketing Anna Rosenman explained to me as “a really positive testament that we have been able to replicate a consumer interaction model.”

If I missed anything else that happened this week, or if I’m way off base in my take on Hadoop and Spark, please share in the comments.

[soundcloud url=”https://api.soundcloud.com/tracks/191875439″ params=”color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false” width=”100%” height=”166″ iframe=”true” /]

Google open sources a MapReduce framework for C/C++

Google announced on Wednesday that the company is open sourcing a MapReduce framework that will let users run native C and C++ code in their Hadoop environments. Depending on how much traction MapReduce for C, or MR4C, gets and by whom, it could turn out to be a pretty big deal.

Hadoop is famously, or infamously, written in Java and as such can suffer from performance issues compared with native C++ code. That’s why Google’s original MapReduce system was written in C++, as is the Quantcast File System, that company’s homegrown alternative for the Hadoop Distributed File System. And, as the blog post announcing MR4C notes, “many software companies that deal with large datasets have built proprietary systems to execute native code in MapReduce frameworks.”

This is the same sort of rationale behind Facebook’s HipHop efforts and database startup MemSQL, whose system converts SQL to C++ before executing it.


MR4C was developed by satellite imagery company Skybox Imaging, which Google acquired last June, and was optimized for geospatial data and computer vision code libraries. Of course, open sourcing MR4c presents the opportunity to open up this capability to a broader range of users, either working in fields dominated by C libraries or those who just don’t like or aren’t comfortable writing programs in Java. When Google announced its open-source Kubernetes container-management system last year, it was quickly ported from Google Compute Engine to run in several other environments.

It will be interesting to see how much traction MR4C gets at this point, especially given the surge in interest around Apache Spark. Spark is a faster data-processing framework than MapReduce, already has a lot of interest, and natively supports Scala, Python and Java, although it does not support C/C++.

The future of Hadoop and big data processing will certainly be a big topic of conversation at our Structure Data conference next month in New York, which features Google VP of infrastructure Eric Brewer, Spark co-creator (and Databricks CEO) Ion Stoica and the CEOs of all three major Hadoop vendors.

BlueTalon raises $5M, sets sights on securing Hadoop

BlueTalon, a database security startup that launched last year around the goal of enabling secure data collaboration, has shifted its focus to Hadoop and is set to release its first product. The company has also raised an additional $5 million in venture capital, it announced on Wednesday, from Signia Venture Partners, Biosys Capital, Bloomberg Beta, Stanford-StartX Fund, Divergent Ventures, Berggruen Holdings and seed investor Data Collective.

Eric Tilenius, BlueTalon’s CEO, told me during an interview that the company decided to pivot and focus on Hadoop because it kept running into a “gaping hole” while speaking with potential customers. They had a pressing issue of how to put more data into Hadoop so they could take advantage of cheap storage and processing, while not simultaneously opening themselves up to data breaches or regulatory issues. One large company, he said, had a team in place to manually vet every request to access data stored in Hadoop.

Hadoop is an unstoppable force but security is the immovable object. Companies want to do away with the untold millions or billions of dollars in enterprise data warehouse contracts “that shouldn’t and don’t need to be there,” Tilenius said, but there hasn’t yet been a way to do it easily or efficiently from a security standpoint.


BlueTalon’s software works by letting administrators set up policies via a graphical user interface, and then enforces those policies across a range of storage engines. Policies can go down to the cell level, or range from user access to use access. Depending on who’s trying to access what, it might deny access, mask certain content or perhaps issue a token. The software also audits system activity and can let security personnel or administrators see who has been accessing, or trying to access information.

The BlueTalon Policy Engine, as the product is called, will be released within the next month and will support the Hive, Impala, and ODBC and JDBC environments. The company hopes to add support for the Hadoop Distributed File System by the end of the second quarter.

BlueTalon is hardly the first company or technology to tackle security and data governance in Hadoop, but Tilenius thinks his company has the easiest and most fine-grained approach available. There are numerous open-source projects and both Cloudera and Hortonworks have made security acquisitions recently, but much of that work is focused around higher-level policies or encryption.

However they choose to get it, though, it’s undeniable that companies — especially large ones — are going to want some sort of improved security for Hadoop. We’ll be sure to ask more about it at our Structure Data conference next month, which features technology leaders from Goldman Sachs, Lockheed Martin and more, as well as executives from across the big data landscape.

Update: This post was corrected at 11:21 a.m. PT to correct the day on which BlueTalon announced its funding.

Cloudera claims more than $100M in revenue in 2014

Hadoop vendor Cloudera announced on Tuesday that the company’s “[p]reliminary unaudited total revenue surpassed $100 million” in 2014. That the company, which is still privately held, would choose to disclose even that much information about its finances speaks to the fast maturation, growing competition and big egos in the Hadoop space.

While $100 million is a nice, round benchmark number, the number by itself doesn’t mean much of anything. We still don’t know how much profit Cloudera made last year or, more likely, how big of a loss it sustained. What we do know, however, is that it earned more than bitter rivals [company]Hortonworks[/company] (it claimed $33.4 million through the first nine months of 2014, and will release its first official earnings report next week) and probably MapR (I’ve reached out to MapR about this and will update this if I’m wrong). However, Cloudera claims 525 customers are paying for its software (an 85 percent improvement since 2013), while MapR in December claimed more than 700 paying customers.

Cloudera also did about as much business as EMC-VMware spinoff Pivotal claims its big data business did in 2014. On Tuesday, Pivotal open sourced much of its Hadoop and database technology, and teamed up with Hortonworks and a bunch of software vendors large and small to form a new Hadoop alliance called the Open Data Platform. Cloudera’s Mike Olson, the company’s chief strategy officer and founding CEO, called the move, essentially, disingenuous and more an attempt to save Pivotal’s business than a real attempt to advance open source Hadoop software.

Hortonworks CEO Rob Bearden at Structure Data 2014.

Hortonworks CEO Rob Bearden at Structure Data 2014.

All of this grandstanding and positioning is part of a quest to secure business in a Hadoop market that analysts predict will be worth billions in the years to come, and also an attempt by each company to prove to potential investors that its business model is the best. Hortonworks surprised a lot of people by going pubic in December, and the stock has remained stable since then (although its share price dropped more than two percent on Tuesday despite the news with Pivotal). Many people suspect Cloudera and MapR will go public this year, and Pivotal at some point as well.

This much action should make for an entertaining and informative Structure Data conference, which is now less than a month away. We’ll have the CEOs of Cloudera, Hortonworks and MapR all on stage talking about the business of big data, as well as the CEO of Apache Spark startup Databricks, which might prove to be a great partner for Hadoop vendors as well as a thorn in their sides. Big users, including Goldman Sachs, ESPN and Lockheed Martin, will also be talking about the technologies and objectives driving their big data efforts.

Pivotal open sources its Hadoop and Greenplum tech, and then some

Pivotal, the cloud computing and big data company that spun out from EMC and VMware in 2013, is open sourcing its entire portfolio of big data technologies and is teaming up with Hortonworks, IBM, GE, and several other companies on a Hadoop effort called the Open Data Platform.

Rumors about the fate of the company’s data business have been circulating since a round of layoffs began in November, but, according to Pivotal, the situation isn’t as dire as some initial reports suggested.

There is a lot of information coming out of the company about this, but here are the key parts:

  • Pivotal is still selling licenses and support for its Greenplum, HAWQ and GemFire database products, but it is also releasing the core code bases for those technologies as open source.
  • Pivotal is still offering its own Hadoop distribution, Pivotal HD, but has slowed development on core components of MapReduce, YARN, Ambari and the Hadoop Distributed File System. Those four pieces are the starting point for a new association called the Open Data Platform, which includes Pivotal, [company]GE[/company], [company]Hortonworks[/company], [company]IBM[/company], Infosys, Pivotal, SAS, Altiscale, [company]EMC[/company], [company]Verizon[/company] Enterprise Solutions, [company]VMware[/company], [company]Teradata[/company] and “a large international telecommunications firm,” and which promises to build its Hadoop technologies using a standard core of code.
  • Pivotal is working with Hortonworks to make Pivotal’s big data technologies run on the Hortonworks Data Platform, and eventually on the Open Data Platform core. Pivotal will continue offering enterprise support for Pivotal HD, although it will outsource to Hortonworks support requests involving the guts of Hadoop (e.g., MapReduce and HDFS).

Sunny Madra, vice president of the data and mobile product group at Pivotal, said the company has a relatively successful big data business already — $100 million overall, $40 million of which came from the Big Data Suite license bundle it announced last year — but suggested that it sees the writing on the wall. Open source software is a huge industry trend, and he thinks pushing against it is as fruitless as pushing against cloud computing several years ago.

“We’re starting to see open source pop up as an RFP within enterprises,” he said. “. . . If you’re picking software [today] . . . you’d look to open source.”


The Pivotal Big Data Suite.

Madra pointed to Pivotal’s revenue numbers as proof the company didn’t open source its software because no one wanted to pay for it. “We wouldn’t have a $100 million business . . . if we couldn’t sell this,” he said. Maybe, but maybe not: Hortonworks isn’t doing $100 million a year, but word was that Cloudera was doing it years ago (on Tuesday, Cloudera did claim more than $100 million in revenue in 2014). Depending how one defines “big data,” companies like Microsoft and Oracle are probably making much more money.

However, there were some layoffs late last year, which Madra attributed to consolidation of people, offices and efforts rather than a failing business. Pivotal wanted to close some global offices and bring the data team and Cloud Foundry teams under the same leadership, and to focus its development resources on its own intellectual property around Hadoop. “Do we really need a team going and testing our own distribution?” he asked, troubleshooting it, certifying it against technologies and all that goes along with that?

EMC first launched the Pivotal HD Hadoop distribution, as well as the HAWQ SQL-on-Hadoop engine, with much ado just over two years ago.

The deal with Hortonworks helps alleviate that engineering burden in the short term, and the Open Data Platform is supposed to help solve it over a longer period. Madra explained the goal of the organization as Linux-like, meaning that customers should be able to switch from one Hadoop distribution to the next and know the kernel will be the same, just like they do with the various flavors of the Linux operating system.

Mike Olson, Cloudera’s chief strategy officer and founding CEO, offered a harsh rebuttal to the Open Data Platform in a blog post on Tuesday, questioning the utility and politics of vendor-led consortia like this. He simultaneously praised Hortonworks for its commitment to open source Hadoop and bashed Pivotal on the same issue, but wrote, among other things, of the Open Data Platform: “The Pivotal and Hortonworks alliance, notwithstanding the marketing, is antithetical to the open source model and the Apache way.”

The Pivotal HD and Hawq architecture

Much of this has been open sourced or replaced.

As part of Pivotal’s Tuesday news, the company also announced additions to its Big Data Suite package, including the Redis key-value store, RabbitMQ messaging queue and Spring XD data pipeline framework, as well as the ability to run the various components on the company’s Cloud Foundry platform. Madra actually attributes a lot of Pivotal’s decision to open source its data technologies, as well as its execution, to the relative success the company has had with Cloud Foundry, which has always involved an open source foundation as well as a commercial offering.

“Had we not had the learnings that we had in Cloud Foundry, then I think it would have been a lot more challenging,” he said.

Whether or not one believes Pivotal’s spin on the situation, though, the company is right in realizing that it’s open source or bust in the big data space right now. They have different philosophies and strategies around it, but major Hadoop vendors Cloudera, Hortonworks and MapR are all largely focused on open-source technology. The most popular Hadoop-ecosystem technologies, including Spark, Storm and Kafka, are open source, as well. (CEOs, founders and creators from many of these companies and projects will be speaking at our Structure Data conference next month in New York.)

Pivotal might eventually sell billions of dollars worth of software licenses for its suite of big data products — there’s certainly a good story there if it can align the big data and Cloud Foundry businesses into a cohesive platform — but it probably has reached its plateau without having an open source story to tell.

Update: This post was updated at 12:22 p.m. PT to add information about Cloudera’s revenue.