Airbnb open sources SQL tool built on Facebook’s Presto database

Apartment-sharing startup Airbnb has open sourced a tool called Airpal that the company built to give more of its employees access to the data they need for their jobs. Airpal is built atop the Presto SQL engine that Facebook created in order to speed access to data stored in Hadoop.

Airbnb built Airpal about a year ago so that employees across divisions and roles could get fast access to data rather than having to wait for a data analyst or data scientist to run a query for them. According to product manager James Mayfield, it’s designed to make it easier for novices to write SQL queries by giving them access to a visual interface, previews of the data they’re accessing, and the ability to share and reuse queries.

It sounds a little like the types of tools we often hear about inside data-driven companies like Facebook, as well as the new SQL platform from a startup called Mode.

At this point, Mayfield said, “Over a third of all the people working at Airbnb have issued a query through Airpal.” He added, “The learning curve for SQL doesn’t have to be that high.”

He shared the example of folks at Airbnb tasked with determining the effectiveness of the automated emails the company sends out when someone books a room, resets a password or takes any of a number of other actions. Data scientists used to have to dive into Hive — the SQL-like data warehouse framework for Hadoop that [company]Facebook[/company] open sourced in 2008 — to answer that type of question, which meant slow turnaround times because of human and technological factors. Now, lots of employees can access that same data via Airpal in just minutes, he said.

The Airpal user interface.

The Airpal user interface.

As cool as Airpal might be for Airbnb users, though, it really owes its existence to Presto. Back when everyone was using Hive for data analysis inside Hadoop — it was and continues to be widely used within web companies — only 10 to 15 people within Airbnb understood the data and could write queries using its somewhat complicated version of SQL. Because Hive is based on MapReduce, the batch-processing engine most commonly associated with Hadoop, Hive is also slow (although new improvements have increased its speed drastically).

Airbnb also used [company]Amazon[/company]’s Redshift cloud data warehouse for a while, said software engineer Andy Kramolisch, and while it was fast, it wasn’t as user-friendly as the company would have liked. It also required replicating data from Hive, meaning more work for Airbnb and more data for the company to manage. (If you want to hear more about all this Hadoop and big data stuff from leaders at [company]Google[/company], Cloudera and elsewhere, come to our Structure Data conference March 18-19 in New York.)

A couple years ago, Facebook created and then open sourced Presto as a means to solve Hive’s speed problems. It still accesses data from Hive, but is designed to deliver results at interactive speeds rather than in minutes or, depending on the query, much longer. It also uses standard ANSI SQL, which Kramolisch said is easier to learn than the Hive Query Language and its “lots of hidden gotchas.”

Still, Mayfield noted, it’s not as if everyone inside Airbnb, or any company, is going to be running SQL queries using Airpal — no matter how easy the tooling gets. In those cases, he said, the company tries to provide dashboards, visualizations and other tools to help employees make sense of the data they need to understand.

“I think it would be rad if the CEO was writing SQL queries,” he said, “but …”

Hadoop’s new role: Adjunct data warehouse

There was a time, a little over two years ago, when SQL-on-Hadoop was about cracking open access to Hadoop data for those with SQL skillsets and eliminating the exclusivity of access that Hadoop/MapReduce specialists had on the data. Yes, some architectural details – like whether the SQL engine was hitting the data nodes in the Hadoop cluster directly – were important too. But, for the most part, solutions in the space were neatly summed up by the name: SQL, on Hadoop.

Today, SQL-on-Hadoop solutions are best judged not by their SQL engines per se, but instead by the collaborative scenarios they enable between Hadoop and the conventional data warehouse. Hadoop can be seen as a usurper, peer or peripheral of the data warehouse; the SQL-on-Hadoop engine you use determines which one (or more) of these three roles Hadoop can be implemented to fulfill.

In Gigaom Research’s just-published Sector Roadmap: Hadoop/Data Warehouse Interoperability, analyst George Gilbert investigates the SQL-on-Hadoop market, evaluating six solutions, each along six “disruption vectors” or key trends that will affect the market and players over the next year: schema flexibility, data engine interoperability, pricing model, enterprise manageability, workload role optimization and query engine maturity.

Scenarios

As a backdrop to the evaluation of various SQL-on-Hadoop products along these vectors, Gilbert identifies three key analytics usage scenarios. The first is the core data warehouse, a familiar concept for many tech professionals: a relatively expensive appliance-based database platform serving up highly-curated data, with the data’s structure optimized for the kinds of queries the business believes it needs to run.

The second is the so-called “data lake” (called an “enterprise data hub” by some vendors). Here, Hadoop serves as a collecting point for disparate data sources along the full spectrum of unstructured, semi-structured and fully-structured data. Hadoop 2.0’s YARN resource manager facilitates the use of a variety of analysis engines to explore the lake’s data in an ad hoc fashion, and the data warehouse is relieved of this responsibility, free to serve the production queries for which it was designed and tuned.

The third scenario Gilbert identifies is one he calls the “adjunct data warehouse,” wherein various data warehouse tasks – including ETL and reporting – are offloaded from the conventional data warehouse to Hadoop. In fact, the adjunct data warehouse can and should be used to perform these functions on data first explored in the data lake.

Screen-Shot-2015-01-28-at-9_33_42-AM-e1422475808971

In effect, the core data warehouse, adjunct data warehouse and data lake constitute a data processing hierarchy, with a corresponding hierarchy of cost. The hierarchical selection of platforms enables tasks of lower production value (though, arguably, higher business value) to be processed on cheaper platforms – yielding much higher efficiency for enterprise organizations.

Enterprise ROI

How much cheaper? Gilbert notes that Hadoop costs at least an order of magnitude less, per terabyte of data, than appliance-based data warehouses. As Hadoop enables the data lake and adjunct data warehouse scenarios, implementation of them gives Hadoop a significant and demonstrable return-on-investment for enterprise customers.

An open question is whether and when Hadoop can and will serve in a core data warehouse capacity as well. And if it does, will that help the data warehouse vendors, the Hadoop distribution vendors or both? Indeed, this dynamic may be a predictor of future acquisitions of the distribution vendors by the legacy players — or perhaps even the reverse.

Teradata dives further into Hadoop with RainStor acquisition

Data warehouse vendor [company]Teradata[/company] has made its fourth acquisition of the year, announcing on Wednesday it has bought data-archiving specialist RainStor for an undisclosed amount.

RainStor builds an archival system that can sit on top of Hadoop and, it claims, compress data volumes by up to 95 percent. The company has raised roughly $26 million since it was founded in 2004, with the last round — $12 million — coming in October 2012. The deal itself is neither earth-shaking nor bank-breaking (in a press release, Teradata describes the acquisition price as “not material”), but it does further clarify Teradata’s strategy for staying relevant in an increasingly scale-out, open source world.

Taken as a whole with the company’s other acquisitions, including Hadapt and Think Big Analytics, it’s pretty clear that Teradata wants to play a bigger role in companies’ big data environments than just that of a data warehouse and business intelligence provider. If customers are intent on storing and analyzing more data more cheaply in Hadoop or NoSQL data stores, Teradata would rather help them do that and accept a smaller profit margin rather than lose that data and those workloads altogether.

The big question now is for how long the Hadoop market will continue to play nice with existing data-management vendors. With one so-far successful IPO under its belt and others presumably on the way, it’s conceivable companies such as Cloudera, Hortonworks and MapR will attempt to grab a bigger piece of the pie as their war chests get bigger.

LinkedIn explains its complex Gobblin big data framework

LinkedIn shed more light Tuesday on a big-data framework dubbed Gobblin that helps the social network take in tons of data from a variety of sources so that it can be analyzed in its Hadoop-based data warehouses.

IBM builds up its cloud with Netezza as a service and NoSQL as software

IBM announced a new, promising collection of cloud data services on Monday, adding to an already-impressive collections services on its Bluemix platform. At this point, though, IBM’s biggest challenge isn’t selling enterprise users on the cloud, but convincing them it’s still the best choice.

Teradata embraces the big data ecosystem, buys Think Big Analytics

Data warehouse vendor Teradata continues to step up its game in the broader big data market, this time by acquiring consulting firm Think Big Analytics, which specializes in helping clients deploy open source technologies and build analytics applications.