Airbnb open sources SQL tool built on Facebook’s Presto database

Apartment-sharing startup Airbnb has open sourced a tool called Airpal that the company built to give more of its employees access to the data they need for their jobs. Airpal is built atop the Presto SQL engine that Facebook created in order to speed access to data stored in Hadoop.

Airbnb built Airpal about a year ago so that employees across divisions and roles could get fast access to data rather than having to wait for a data analyst or data scientist to run a query for them. According to product manager James Mayfield, it’s designed to make it easier for novices to write SQL queries by giving them access to a visual interface, previews of the data they’re accessing, and the ability to share and reuse queries.

It sounds a little like the types of tools we often hear about inside data-driven companies like Facebook, as well as the new SQL platform from a startup called Mode.

At this point, Mayfield said, “Over a third of all the people working at Airbnb have issued a query through Airpal.” He added, “The learning curve for SQL doesn’t have to be that high.”

He shared the example of folks at Airbnb tasked with determining the effectiveness of the automated emails the company sends out when someone books a room, resets a password or takes any of a number of other actions. Data scientists used to have to dive into Hive — the SQL-like data warehouse framework for Hadoop that [company]Facebook[/company] open sourced in 2008 — to answer that type of question, which meant slow turnaround times because of human and technological factors. Now, lots of employees can access that same data via Airpal in just minutes, he said.

The Airpal user interface.

The Airpal user interface.

As cool as Airpal might be for Airbnb users, though, it really owes its existence to Presto. Back when everyone was using Hive for data analysis inside Hadoop — it was and continues to be widely used within web companies — only 10 to 15 people within Airbnb understood the data and could write queries using its somewhat complicated version of SQL. Because Hive is based on MapReduce, the batch-processing engine most commonly associated with Hadoop, Hive is also slow (although new improvements have increased its speed drastically).

Airbnb also used [company]Amazon[/company]’s Redshift cloud data warehouse for a while, said software engineer Andy Kramolisch, and while it was fast, it wasn’t as user-friendly as the company would have liked. It also required replicating data from Hive, meaning more work for Airbnb and more data for the company to manage. (If you want to hear more about all this Hadoop and big data stuff from leaders at [company]Google[/company], Cloudera and elsewhere, come to our Structure Data conference March 18-19 in New York.)

A couple years ago, Facebook created and then open sourced Presto as a means to solve Hive’s speed problems. It still accesses data from Hive, but is designed to deliver results at interactive speeds rather than in minutes or, depending on the query, much longer. It also uses standard ANSI SQL, which Kramolisch said is easier to learn than the Hive Query Language and its “lots of hidden gotchas.”

Still, Mayfield noted, it’s not as if everyone inside Airbnb, or any company, is going to be running SQL queries using Airpal — no matter how easy the tooling gets. In those cases, he said, the company tries to provide dashboards, visualizations and other tools to help employees make sense of the data they need to understand.

“I think it would be rad if the CEO was writing SQL queries,” he said, “but …”

Cloudera CEO declares victory over big data competition

Cloudera CEO Tom Reilly doesn’t often mince words when it comes to describing his competition in the Hadoop space, or Cloudera’s position among those other companies. In October 2013, Reilly told me he didn’t consider Hortonworks or MapR to be Cloudera’s real competition, but rather larger data-management companies such as IBM and EMC-VMware spinoff Pivotal. And now, Reilly says, “We declare victory over at least one of our competitors.”

He was referring to Pivotal, and the Open Data Platform, or ODP, alliance it helped launched a couple weeks ago along with [company]Hortonworks[/company], [company]IBM[/company], [company]Teradata[/company] and several other big data vendors. In an interview last week, Reilly called that alliance “a ruse and, frankly, a graceful exit for Pivotal,” which laid off a number of employees working on its Hadoop distribution and is now outsourcing most of its core Hadoop development and support to Hortonworks.

You can read more from Reilly below, including his takes on Hortonworks, Hadoop revenues and Spark, as well as some expanded thoughts on the ODP. For more information about the Open Data Platform from the perspectives of the members, you can read our coverage of its launch in mid-February as well as my subsequent interview with Hortonworks CEO Rob Bearden, who explains in some detail how that alliance will work.

If you want to hear about the fast-changing, highly competitive and multi-billion-dollar business of big data straight from horses’ mouths, make sure to attend our Structure Data conference March 18 and 19 in New York. Speakers include Cloudera’s Reilly and Hortonworks’ Bearden, as well as MapR CEO John Schroeder, Databricks CEO (and Spark co-creator) Ion Stoica, and other big data executives and users, including those from large firms such as [company]Lockheed Martin[/company] and [company]Goldman Sachs[/company].

GIGAOM STRUCTURE DATA 2014

You down with ODP? No, not me

While Hortonworks explains the Open Data Platform essentially as a way for member companies to build on top of Hadoop without, I guess, formally paying Hortonworks for support or embracing its entire Hadoop distribution, Reilly describes it as little more than a marketing ploy. Aside from calling it a graceful exit for Pivotal (and, arguably, IBM), he takes issue with even calling it “open.” If the ODP were truly open, he said, companies wouldn’t have to pay for membership, Cloudera would have been invited and, when it asked about the alliance, it wouldn’t have been required to sign a non-disclosure agreement.

What’s more, Reilly isn’t certain why the ODP is really necessary technologically. It’s presently composed of four of the most mature Hadoop components, he explained, and a lot of companies are actually trying to move off of MapReduce (to Spark or other processing engines) and, in some cases, even the Hadoop Distributed File System. Hortonworks, which supplied the ODP core and presumably will handle much of the future engineering work, will be stuck doing the other members’ bidding as they decide which of several viable SQL engines and other components to include, he added.

“I don’t think we could have scripted [the Open Data Platform news] any better,” Reilly said. He added, “[T]he formation of the ODP … is a big shift in the landscape. We think it’s a shift to our advantage.”

(If you want a possibly more nuanced take on the ODP, check out this blog post by Altiscale CEO Raymie Stata. Altiscale is an ODP member, but Stata has been involved with the Apache Software Foundation and Hadoop since his days as Yahoo CTO and is a generally trustworthy source on the space.)

Hortonworks CEO Rob Bearden at Structure Data 2014.

Hortonworks CEO Rob Bearden at Structure Data 2014.

Really, Hortonworks isn’t a competitor?

Asked about the competitive landscape among Hadoop vendors, Reilly doubled down on his assessment from last October, calling Cloudera’s business model “a much more aggressive play [and] a much bolder vision” than what Hortonworks and MapR are doing. They’re often “submissive” to partners and treat Hadoop like an “add-on” rather than a focal point. If anything, Hortonworks has burdened itself by going public and by signing on to help prop up the legacy technologies that IBM and Pivotal are trying to sell, Reilly said.

Still, he added, Cloudera’s “enterprise data hub” strategy is more akin to the IBM and Pivotal business models of trying to become the centerpiece of customers’ data architectures by selling databases, analytics software and other components beside just Hadoop.

If you don’t buy that logic, Reilly has another argument that boils down to money. Cloudera earned more than $100 million last year (that’s GAAP revenue, he confirmed), while Hortonworks earned $46 million and, he suggested, MapR likely earned a similar number. Combine that with Cloudera’s huge investment from Intel in 2014 — it’s now “the largest privately funded enterprise software company in history,” Reilly said — and Cloudera owns the Hadoop space.

“We intend to take advantage” of this war chest to acquire companies and invest in new products, Reilly said. And although he wouldn’t get into specifics, he noted, “There’s no shortage of areas to look in.”

Diane Bryant, senior vice president and general manager of Intel's Data Center Group, at Structure 2014.

Diane Bryant, senior vice president and general manager of Intel’s Data Center Group, at Structure 2014.

The future is in applications

Reilly said that more than 60 percent of Cloudera sales are now “enterprise data hub” deployments, which is his way of saying its customers are becoming more cognizant of Hadoop as an application platform rather than just a tool. Yes, it can still store lots of data and transform it into something SQL databases can read, but customers are now building new applications for things like customer churn and network optimization with Hadoop as the core. Between 15 and 20 financial services companies are using Cloudera to power detect money laundering, he said, and Cloudera has trained its salesforce on a handful of the most popular use cases.

One of the technologies helping make Hadoop look a lot better for new application types is Spark, which simplifies the programming of data-processing jobs and runs them a lot faster than MapReduce does. Thanks to the YARN cluster-management framework, users can store data in Hadoop and process it using Spark, MapReduce and other processing engines. Reilly reiterated Cloudera’s big investment and big bet on Spark, saying that he expects a lot of workloads will eventually run on it.

Databricks CEO (and AMPLab co-director) Ion Stoica.

Databricks CEO (and Spark co-creator) Ion Stoica.

A year into the Intel deal and …

“It is a tremendous partnership,” Reilly said.

[company]Intel[/company] has been integral in helping Cloudera form partnerships with companies such as Microsoft and EMC, as well as with customers such as MasterCard, he said. The latter deal is particularly interesting because Cloudera and Intel’s joint engineering on hardware-based encryption helped Cloudera deploy a PCI-compliant Hadoop cluster and MasterCard is now out pushing that system to its own clients via its MasterCard Advisors professional services arm.

Reilly added that Cloudera and Intel are also working together on new chips designed specifically for analytic workloads, which will take advantage of non-RAM memory types.

Asked whether Cloudera’s push to deploy more workloads in cloud environments is at odds with Intel’s goal to sell more chips, Reilly pointed to Intel’s recent strategy of designing chips especially for cloud computing environments. The company is operating under the assumption that data has gravity and that certain data that originates in the cloud, such as internet-of-things or sensor data, will stay there, while large enterprises will continue to store a large portion of their data locally.

Wherever they run, Reilly said, “[Intel] just wants more workloads.”

Russia’s Runa Capital invests $3.4M in database firm MariaDB

Moscow-based Runa Capital has invested €3 million ($3.4 million) in MariaDB, the open-source database company that offers what began as a MySQL fork (Google and Wikipedia are big-name users). Runa, which is headed up by founders of Acronis and Parallels, is already a backer of the Nginx web server and platform-as-a-service outfit Jelastic. In a statement, MariaDB CEO Patrik Sallner said his firm was looking forward to collaborating with Runa and its other open-source portfolio companies in its enterprise push.

Hortonworks did $12.7M in Q4, on its path to a billion, CEO says

Hadoop vendor Hortonworks announced its first quarterly earnings as a publicly held company Tuesday, claiming $12.7 million in fourth-quarter revenue and $46 million in revenue during fiscal year 2014. The numbers represent 55 percent quarter-over-quarter and 91 percent year-over-year increases, respectively. The company had a net loss of $90.6 million in the fourth quarter and $177.3 million for the year.

However, [company]Hortonworks[/company] contends that revenue is not the most important number in assessing its business. Rather, as CEO Rob Bearden explained around the time the company filed its S-1 pre-IPO statement in November, Hortonworks’ thinks its total billings are a more accurate representation of its health. That’s because the company relies fairly heavily on professional services, meaning the company often doesn’t get paid until a job is done.

The company’s billings in the fourth quarter totaled $31.9 million, a 148 percent year-over-year increase. Its fiscal year billings were $87.1 million, a 134 percent increase over 2013.

If you buy Bearden’s take on the importance of billings over revenue, then Hortonworks looks a lot more comparable in size to its largest rival, Cloudera. Last week, Cloudera announced more than $100 million in revenue in 2014, as well as an 85 percent increase in subscription software customers up to 525 in total.

Hortonworks, for its part, added 99 customers paying for enterprise support of its Hadoop platform in the fourth quarter alone, bringing its total to 332. Among those customers are Expedia, Macy’s, Blackberry and Spotify, all four of which moved directly to Hortonworks from Cloudera, a Hortonworks spokesperson said.

There are, however, some key differences between the Hortonworks and Cloudera business models, as well as that of fellow vendor MapR, that affect how comparable any of these metrics really are. While Hortonworks is focused on free open source software and relies on support contracts for revenue, Cloudera and MapR offer both free Hadoop distributions as well as more feature-rich paid versions. In late-2013, Cloudera CEO Tom Reilly told me his company was interested in securing big deployments rather than chasing cheap support contracts.

Rob Bearden

Rob Bearden at Structure Data 2014

I had a broad discussion with Bearden last week about the Hadoop market and some of Hortonworks’ recent moves in that space, including the somewhat-controversial Open Data Platform alliance it helped to create along with Pivotal, [company]IBM[/company], [company]GE[/company] and others. Here are the highlights from that interview. (If you want to hear more from Bearden and perhaps ask him some of your own questions, make sure to attend our Structure Data conference March 18 and 19 in New York. Other notable Hadoop-market speakers include Cloudera CEO Tom Reilly, MapR CEO John Schroeder and Databricks CEO (and Spark co-creator) Ion Stoica.)

Explain the rationale behind the Open Data Platform

Bearden wouldn’t comment specifically on criticisms — made most loudly by Cloudera’s Mike Olson and Doug Cutting, as well as some industry analysts — that the Open Data Platform, or ODP, is somehow antithetical to open source or the Apache Software Foundation. “What I would say,” he noted, “is the people who are committed to true open source and an open platform for the community are pretty excited about the thing.”

He also chalked up a lot of the criticism of the ODP to misunderstanding about how it really will work in practice. “One of the things I don’t think is very clear on the Open Data Platform alliance is that we’re actually going to provide what we’ll refer to as the core for that alliance, that is based on core Hadoop — so HDFS and YARN and Ambari,” Bearden explained. “We’re providing that, which is obviously directly from Apache, and it’s the exact same bit line that [the Hortonworks Data Platform] is based on.”

Pivotal CEO Paul Maritz at Structure Data 2014.

Paul Maritz, CEO of Hortonworks partner, and ODP member Pivotal, at Structure Data 2014.

So, the core Hadoop distribution that platform members will use is based on Apache code, and anything that ODP members want to add on top of it will also have to go through Apache. These could be existing Apache projects, or they could be new projects he members decide to start on their own, Bearden said.

“We’re actually strengthening the position of the Apache Software Foundation,” he said. He added later in the interview, on the same point, that people shouldn’t view the ODP as much different than they view Hortonworks (or, in many respects, Cloudera or MapR). “[The Apache Software Foundation] is the engineering arm,” he said, “and this entity will become he productization and packaging arm for [Apache].”

So, it’s Cloudera vs. MapR vs. Hortonworks et al?

I asked Bearden whether the formation of the ODP officially makes the Hadoop market a case of Cloudera and MapR versus the Hortonworks ecosystem. That seems like the case to me, considering that the ODP is essentially providing the core for a handful of potentially big players in the Hadoop space. And even if they’re not ODP members, companies such as [company]Microsoft[/company] and [company]Rackspace[/company] have built their Hadoop products largely on top of the Hortonworks platform and with its help.

Bearden wouldn’t bite. At least not yet.

“I wouldn’t say it’s the other guys versus all of us,” he said. “I would say what’s happened is the community has realized this is what they want and it fits in our model that we’re driving very cleanly. . . . And we’re not doing anything up the stack to try and disintermediate them, and we de-risk it because we’re all open.”

The this he’s referring to is the ability of its partners to stop spending resources keeping up with the core Hadoop technology and instead focus on how they can monetize their own intellectual property. “To do that, the more data they put under management, the faster and the more-stable and enterprise-viable [the platform on which they have that data], the faster they monetize and the bigger they monetize the rest of their platform,” Bearden said.

Microsoft CEO Satya Nadella speaks at a Microsoft cloud event. Photo by Jonathan Vanian/Gigaom

Microsoft CEO Satya Nadella speaks at a Microsoft cloud event about that company’s newfound embrace of open source.

Are you standing by your prediction of a billion-dollar company?

“I am not backing off that at all,” Bearden said, in reference to his prediction at Structure Data last year that Hadoop will soon become a multi-billion-dollar market and Hortonworks will be a billion-dollar company in terms of revenue. He said it’s fair to look at revenue alone is assessing the businesses in this space, but it’s not the be all, end all.

“It’s less about [pure money] and more about what is the ecosystem doing to really start adopting this,” he said. “Are they trying fight it and reject it, or are they really starting to embrace it and pull it through? Same with the big customers. . . .
“When those things are happening, the money shows up. It just does.”

Hadoop is actually just a part — albeit a big one — of a major evolution in the data-infrastructure space, he explained. And as companies start replacing the pieces of their data environments, they’ll do so with the open source options that now dominate new technologies. These include Hadoop, NoSQL databases, Storm, Kafka, Spark and the like.

In fact, Bearden said, “Open source companies can be very successful in terms of revenue growth and in terms of profitability faster than the old proprietary platforms got there.”

Time will tell.

Update: This post was updated at 8:39 p.m. PT to correct the amount of Hortonworks’ fourth quarter revenue and losses. Revenue was $12.7 million, not $12.5 million as originally reported, and losses were $90.6 million for the quarter and $177.3 million for the year. The originally reported numbers were for gross loss.

Advertising analytics platform Metamarkets raises another $15M

Metamarkets, a San Francisco startup providing streaming analytics to advertisers, has raised a $15 million round of venture capital. Data Collective led the round, which also included John Battelle and City National Bank and existing investors Khosla Ventures, IA Ventures, True Ventures and Village Ventures. Metamarkets (see disclosure) has now raised more than $43 million since launching in 2010.

While the company’s product, like most software as a service, is tied to a rather specific use case, its technology is not. To many engineers working on big data systems, Metamarkets is also known as the creator of Druid, an open source data store the company created in order to handle the speed and scale its analytics platform requires. Last week, Metamarkets announced that Druid is now available under the permissive Apache 2 open source license.

Metamarkets CEO Mike Driscoll has a clear view of where the software world is headed, and he thinks his company is one of many helping take it there. Essentially, it’s a world where applications are delivered as cloud services, and infrastructure technologies become commodities, often created and then open sourced by the companies building those applications.

If you look at the technology landscape today, it’s hard to argue with that premise. Large companies such as Google, Facebook and Yahoo have created a large number of today’s popular data analysis, data storage, programming and other technologies, and now Metamarkets, Airbnb, Twitter and others are getting in on the act. It’s not wholly inconceivable that future developers, even at the enterprise level will be able to find anything they need as open source, meaning money only has to change hands for services, support and, of course, applications.

Disclosure: Metamarkets is a portfolio company of True Ventures, which is also an investor in Gigaom.

For now, Spark looks like the future of big data

Titles can be misleading. For example, the O’Reilly Strata + Hadoop World conference took place in San Jose, California, this week but Hadoop wasn’t the star of the show. Based on the news I saw coming out of the event, it’s another Apache project — Spark — that has people excited.

There was, of course, some big Hadoop news this week. Pivotal announced it’s open sourcing its big data technology and essentially building its Hadoop business on top of the [company]Hortonworks[/company] platform. Cloudera announced it earned $100 million in 2014. Lost in the grandstanding was MapR, which announced something potentially compelling in the form of cross-data-center replication for its MapR-DB technology.

But pretty much everywhere else you looked, it was technology companies lining up to support Spark: Databricks (naturally), Intel, Altiscale, MemSQL, Qubole and ZoomData among them.

Spark isn’t inherently competitive with Hadoop — in fact, it was designed to work with Hadoop’s file system and is a major focus of every Hadoop vendor at this point — but it kind of is. Spark is known primarily as an in-memory data-processing framework that’s faster and easier than MapReduce, but it’s actually a lot more. Among the other projects included under the Spark banner are file system, machine learning, stream processing, NoSQL and interactive SQL technologies.

The Spark platform, minus the Tachyon file system and some younger related projects.

The Spark platform, minus the Tachyon file system and some younger related projects.

In the near term, it probably will be that Hadoop pulls Spark into the mainstream because Hadoop is still at least a cheap, trusted big data storage platform. And with Spark still being relatively immature, it’s hard to see too many companies ditching Hadoop MapReduce, Hive or Impala for their big data workloads quite yet. Wait a few years, though, and we might start seeing some more tension between the two platforms, or at least an evolution in how they relate to each other.

This will be especially true if there’s a big breakthrough in RAM technology or prices drop to a level that’s more comparable to disk. Or if Databricks can convince companies they want to run their workloads in its nascent all-Spark cloud environment.

Attendees at our Structure Data conference next month in New York can ask Spark co-creator and Databricks CEO Ion Stoica all about it — what Spark is, why Spark is and where it’s headed. Coincidentally, Spark Summit East is taking place the exact same days in New York, where folks can dive into the nitty gritty of working with the platform.

There were also a few other interesting announcements this week that had nothing to do with Spark, but are worth noting here:

  • [company]Microsoft[/company] added Linux support for its HDInsight Hadoop cloud service, and Python and R programming language support for its Azure ML cloud service. The latter also now lets users deploy deep neural networks with a few clicks. For more on that, check out the podcast interview with Microsoft Corporate Vice President of Machine Learning (and Structure Data speaker) Joseph Sirosh embedded below.
  • [company]HP[/company] likes R, too. It announced a product called HP Haven Predictive Analytics that’s powered by a distributed version of R developed by HP Labs. I’ve rarely heard HP and data science in the same sentence before, but at least it’s trying.
  • [company]Oracle[/company] announced a new analytic tool for Hadoop called Big Data Discovery. It looks like a cross between Platfora and Tableau, and I imagine will be used primarily by companies that already purchase Hadoop in appliance form from Oracle. The rest will probably keep using Platfora and Tableau.
  • [company]Salesforce.com[/company] furthered its newfound business intelligence platform with a handful of features designed to make the product easier to use on mobile devices. I’m generally skeptical of Salesforce’s prospects in terms of stealing any non-Salesforce-related analytics from Tableau, Microsoft, Qlik or anyone else, but the mobile angle is compelling. The company claims more than half of user engagement with the platform is via mobile device, which its Director of Product Marketing Anna Rosenman explained to me as “a really positive testament that we have been able to replicate a consumer interaction model.”

If I missed anything else that happened this week, or if I’m way off base in my take on Hadoop and Spark, please share in the comments.

[soundcloud url=”https://api.soundcloud.com/tracks/191875439″ params=”color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false” width=”100%” height=”166″ iframe=”true” /]

The Druid real-time database moves to an Apache license

Druid, an open source database designed for real-time analysis, is moving to the Apache 2 software license in order to hopefully spur more use of and innovation around the project. It was open sourced in late 2012 under the GPL license, which is generally considered more restrictive than the Apache license in terms of how software can be reused.

Druid was created by advertising analytics startup Metamarkets (see disclosure) and is used by numerous large web companies, including eBay, Netflix, PayPal, Time Warner Cable and Yahoo. Because of the nature Metamarkets’ business, Druid requires data to include a timestamp and is probably best described as a time-series database. It’s designed to ingest terabytes of data per hour and is often used for things such as analyzing user or network activity over time.

Mike Driscoll, Metamarkets’ co-founder and CEO, is confident now is the time for open source tools to really catch on — even more so than they already have in the form of Hadoop and various NoSQL data stores — because of the ubiquity of software as a service and the emergence of new resource managers such as Apache Mesos. In the former case, open source technologies underpin multiuser applications that require a high degree of scale and flexibility on the infrastructure level, while in the latter case databases like Druid are just delivered as a service internally from a company’s pool of resources.

However it happens, Driscoll said, “I don’t think proprietary databases have long for this world.”

Disclosure: Metamarkets is a portfolio company of True Ventures, which is also an investor in Gigaom.

Google open sources a MapReduce framework for C/C++

Google announced on Wednesday that the company is open sourcing a MapReduce framework that will let users run native C and C++ code in their Hadoop environments. Depending on how much traction MapReduce for C, or MR4C, gets and by whom, it could turn out to be a pretty big deal.

Hadoop is famously, or infamously, written in Java and as such can suffer from performance issues compared with native C++ code. That’s why Google’s original MapReduce system was written in C++, as is the Quantcast File System, that company’s homegrown alternative for the Hadoop Distributed File System. And, as the blog post announcing MR4C notes, “many software companies that deal with large datasets have built proprietary systems to execute native code in MapReduce frameworks.”

This is the same sort of rationale behind Facebook’s HipHop efforts and database startup MemSQL, whose system converts SQL to C++ before executing it.

MR4C

MR4C was developed by satellite imagery company Skybox Imaging, which Google acquired last June, and was optimized for geospatial data and computer vision code libraries. Of course, open sourcing MR4c presents the opportunity to open up this capability to a broader range of users, either working in fields dominated by C libraries or those who just don’t like or aren’t comfortable writing programs in Java. When Google announced its open-source Kubernetes container-management system last year, it was quickly ported from Google Compute Engine to run in several other environments.

It will be interesting to see how much traction MR4C gets at this point, especially given the surge in interest around Apache Spark. Spark is a faster data-processing framework than MapReduce, already has a lot of interest, and natively supports Scala, Python and Java, although it does not support C/C++.

The future of Hadoop and big data processing will certainly be a big topic of conversation at our Structure Data conference next month in New York, which features Google VP of infrastructure Eric Brewer, Spark co-creator (and Databricks CEO) Ion Stoica and the CEOs of all three major Hadoop vendors.

BlueTalon raises $5M, sets sights on securing Hadoop

BlueTalon, a database security startup that launched last year around the goal of enabling secure data collaboration, has shifted its focus to Hadoop and is set to release its first product. The company has also raised an additional $5 million in venture capital, it announced on Wednesday, from Signia Venture Partners, Biosys Capital, Bloomberg Beta, Stanford-StartX Fund, Divergent Ventures, Berggruen Holdings and seed investor Data Collective.

Eric Tilenius, BlueTalon’s CEO, told me during an interview that the company decided to pivot and focus on Hadoop because it kept running into a “gaping hole” while speaking with potential customers. They had a pressing issue of how to put more data into Hadoop so they could take advantage of cheap storage and processing, while not simultaneously opening themselves up to data breaches or regulatory issues. One large company, he said, had a team in place to manually vet every request to access data stored in Hadoop.

Hadoop is an unstoppable force but security is the immovable object. Companies want to do away with the untold millions or billions of dollars in enterprise data warehouse contracts “that shouldn’t and don’t need to be there,” Tilenius said, but there hasn’t yet been a way to do it easily or efficiently from a security standpoint.

BlueTalon-D1

BlueTalon’s software works by letting administrators set up policies via a graphical user interface, and then enforces those policies across a range of storage engines. Policies can go down to the cell level, or range from user access to use access. Depending on who’s trying to access what, it might deny access, mask certain content or perhaps issue a token. The software also audits system activity and can let security personnel or administrators see who has been accessing, or trying to access information.

The BlueTalon Policy Engine, as the product is called, will be released within the next month and will support the Hive, Impala, and ODBC and JDBC environments. The company hopes to add support for the Hadoop Distributed File System by the end of the second quarter.

BlueTalon is hardly the first company or technology to tackle security and data governance in Hadoop, but Tilenius thinks his company has the easiest and most fine-grained approach available. There are numerous open-source projects and both Cloudera and Hortonworks have made security acquisitions recently, but much of that work is focused around higher-level policies or encryption.

However they choose to get it, though, it’s undeniable that companies — especially large ones — are going to want some sort of improved security for Hadoop. We’ll be sure to ask more about it at our Structure Data conference next month, which features technology leaders from Goldman Sachs, Lockheed Martin and more, as well as executives from across the big data landscape.

Update: This post was corrected at 11:21 a.m. PT to correct the day on which BlueTalon announced its funding.