Airbnb open sources SQL tool built on Facebook’s Presto database

Apartment-sharing startup Airbnb has open sourced a tool called Airpal that the company built to give more of its employees access to the data they need for their jobs. Airpal is built atop the Presto SQL engine that Facebook created in order to speed access to data stored in Hadoop.

Airbnb built Airpal about a year ago so that employees across divisions and roles could get fast access to data rather than having to wait for a data analyst or data scientist to run a query for them. According to product manager James Mayfield, it’s designed to make it easier for novices to write SQL queries by giving them access to a visual interface, previews of the data they’re accessing, and the ability to share and reuse queries.

It sounds a little like the types of tools we often hear about inside data-driven companies like Facebook, as well as the new SQL platform from a startup called Mode.

At this point, Mayfield said, “Over a third of all the people working at Airbnb have issued a query through Airpal.” He added, “The learning curve for SQL doesn’t have to be that high.”

He shared the example of folks at Airbnb tasked with determining the effectiveness of the automated emails the company sends out when someone books a room, resets a password or takes any of a number of other actions. Data scientists used to have to dive into Hive — the SQL-like data warehouse framework for Hadoop that [company]Facebook[/company] open sourced in 2008 — to answer that type of question, which meant slow turnaround times because of human and technological factors. Now, lots of employees can access that same data via Airpal in just minutes, he said.

The Airpal user interface.

The Airpal user interface.

As cool as Airpal might be for Airbnb users, though, it really owes its existence to Presto. Back when everyone was using Hive for data analysis inside Hadoop — it was and continues to be widely used within web companies — only 10 to 15 people within Airbnb understood the data and could write queries using its somewhat complicated version of SQL. Because Hive is based on MapReduce, the batch-processing engine most commonly associated with Hadoop, Hive is also slow (although new improvements have increased its speed drastically).

Airbnb also used [company]Amazon[/company]’s Redshift cloud data warehouse for a while, said software engineer Andy Kramolisch, and while it was fast, it wasn’t as user-friendly as the company would have liked. It also required replicating data from Hive, meaning more work for Airbnb and more data for the company to manage. (If you want to hear more about all this Hadoop and big data stuff from leaders at [company]Google[/company], Cloudera and elsewhere, come to our Structure Data conference March 18-19 in New York.)

A couple years ago, Facebook created and then open sourced Presto as a means to solve Hive’s speed problems. It still accesses data from Hive, but is designed to deliver results at interactive speeds rather than in minutes or, depending on the query, much longer. It also uses standard ANSI SQL, which Kramolisch said is easier to learn than the Hive Query Language and its “lots of hidden gotchas.”

Still, Mayfield noted, it’s not as if everyone inside Airbnb, or any company, is going to be running SQL queries using Airpal — no matter how easy the tooling gets. In those cases, he said, the company tries to provide dashboards, visualizations and other tools to help employees make sense of the data they need to understand.

“I think it would be rad if the CEO was writing SQL queries,” he said, “but …”

Apache Hive creators raise $13M for their Hadoop service, Qubole

Qubole, the Hadoop-as-a-service startup from Ashish Thusoo and Joydeep Sen Sarma, has raised a $13 million series B round of venture capital led by Norwest Ventures. Thusoo and Sen Sarma created the Apache Hive data warehouse framework for Hadoop while at Facebook several years ago, and launched Qubole in mid-2012. The company has now raised $20 million from investors.

Qubole is hosted on the Amazon Web Services cloud, but can also run on Google Compute Engine, and acts like one might expect a cloud-native Hadoop service to act. It has a graphical user interface, connectors to several common data sources (including cloud object stores), and it takes advantage of cloud capabilities such as autoscaling and spot pricing for compute. The company claims it processes 83 petabytes of data per month and that its customers used 4.96 million cloud compute hours in November.

What’s interesting about Qubole is that although it originally boasted optimized versions of Hive and other MapReduce-based tools, the company also lets users analyze data using the Facebook-created Presto SQL-on-Hadoop engine, and is working on a service around the increasingly popular and very fast Apache Spark framework.

Structure Data 2013 Ashish Thusoo Quobole

Ashish Thusoo at Structure Data 2013.

Qubole’s announcement follows that of a $30 million round for Altiscale on Wednesday and a $3 million round for a newer company called Xplenty in October.

In an interview about Altiscale’s funding, its founder and CEO, Raymie Stata, said his company most often runs up against Qubole and Treasure Data, and occasionally Xplenty, in customer deals. They’re all a little different in terms of capabilities, user experience and probably even target user, but they’re all much more fully featured and user-centric than Amazon Elastic MapReduce, which is the default Hadoop cloud service.

That space could be setting itself up for consolidation as investors keep putting money into it and bigger Hadoop vendors keep trying to bolster their cloud computing stories. Cloudera, Hortonworks, MapR, IBM, Pivotal, Oracle and the list goes on — they all see a future where more workloads will move to the cloud, but they’re all rooted in the software world. At some point they’re going to have to build up their cloud technologies and knowledge, or buy them.

eBay open sources a big, fast SQL-on-Hadoop database

eBay has open sourced a database technology, called Kylin, that takes advantage of distributed processing and the HBase data store in order to return faster results for SQL queries over Hadoop data.

Advice for enterprises looking to manage new analytics capabilities

Analytics is all the rage, with Hadoop and big data leading the hype. But the new technologies are not yet mature, a market full of startups is yet to shake out, and most enterprises have a mishmash of solutions from legacy warehouses to marketing department SaaS subscriptions and early developmental projects.

I talked this week with Andrew Brust, the Gigaom Research research director for big data and analytics, about the state of the market and any recommendations he might have for CIOs grappling with the optimal deployment of the technology. Among the management-oriented suggestions that Andrew offered are the following key points:

  • Don’t overregulate all use of analytics. Andrew likens the current acquisition of SaaS data sources through the enterprise to the adoption of PCs in the 1980s. There was pent up demand for ready access to computing power that traditional IT wasn’t providing. Eventually PCs became so pervasive and so central to the business that centralized management was required and became the norm. Through those Wild West early years of PC use, however, companies not only gained immediate value and, sometimes, an early-adopter advantage, but through trial and error many of those renegade PC users also discovered and honed valuable new applications that were subsequently adopted in their organizations more broadly. Andrew sees the same dynamic at work today, and he expects departmental experimentation will likewise lead to valuable new applications of analytics and big, streaming data flows.
  • Leave room for ad hoc data use. Just like the early days of end users exploiting VisiCalc, Lotus 1-2-3, and Microsoft Excel, many applications of new analytics are of a one-off or experimental nature. They are uniquely suited to the needs of an individual employee or a small workgroup, and are best developed by individual workers who aren’t encumbered with all sorts of heavy and unnecessary data governance requirements. This casual use of analytics is fundamental to a healthy organization. Already there are a number of new self-service tools that are making a new level of casual analysis  viable, and as the technology matures more nontechnical users are about to gain access to an entirely new level of data and analysis. That will be a good thing.
  • Recognize the threshold for when more data governance is required. Undoubtedly, there are data governance requirements for sensitive data and data that must be integrated for broader use within the organization. And, as one recent study points out, a CEO-led interest in the innovative use of analytics is correlated with the greater use of the capability throughout an enterprise. Andrew says there is a recognizable gradation and threshold as to when informal data use needs to be regulated: “you’ll know it when you see it”. IT organizations must be proactive in identifying and handling such situations, although Andrew is skeptical of such heavy-handed techniques as naming a Chief Data Officer to oversee all data use.
  • The best metrics often bubble up democratically, rather than being imposed from above. A corollary to allowing casual data use through the organization is that individuals and small departments often know best how to do their jobs. In some ways they are thus the ones who best know how to measure their efforts as well. Although, as a recent Zendesk analysis in customer service confirmed, companies that measure performance get better performance, there is a risk in imposing too many metrics from above that may stifle and limit individual contributions. Andrew points to another historical trend—the rise and fall of the ‘balanced scorecard’—as an example of the impracticality of too heavy a hand in imposing top-down-derived metrics on too many aspects of a company’s operations. This is therefore an area where allowing bottom-up data experimentation can lead to better organizational practices. The best individual data findings are often adopted at the workgroup or departmental level, and some of the very best of those may percolate up for use corporate-wide. Although line-of-business managers may be the best at identifying these more broadly applicable uses of data, Andrew points out that IT departments may sometimes spot the same, based on patterns of data use that can be tracked within an analytics system.
  • Know your organization’s appetite for experimental technology. Andrew notes that the Hadoop environment is rapidly maturing. However, we are still in the early days of the technology. Atop the promise of open source as a defense against vendor lock-in is the usual trade-off of proprietary, or at least vendor-dependent, enhancements that provide greater functionality than the current open-only standard. The immediate payoff from being pulled in the direction of those vendor enhancements may justify a risk that the solution does not survive in the longer term. However, Andrew offers a couple of suggestions for IT departments wary of going down that path. Apache Hive has a widely used SQL-like language that works on Hadoop. It provides only traditional batch query, but may be a match for the ready skillset in some organizations. Apache Spark is another open source enhancement to Hadoop that provides in-memory analysis that is appropriate for some applications (e.g., market analytics), but not all. Spark is being widely adopted by leading Hadoop vendors (e.g., Cloudera, Hortonworks), and so offers a degree of safety in a fragmented market. Finally, enterprises with large data warehouse operations that may be hesitant to inch too far out on the early-adopter limb can probably turn to their data warehouse vendors for Hadoop tools, rather than opting for more advanced capabilities from less stable, startup suppliers.

Hortonworks’ effort to speed up Hive is coming along nicely

Hortonworks is making progress on its mission (via a project called Stinger) to speed up SQL-like queries in Hadoop using Apache Hive. New features in the latest version of Hortonworks’ Hadoop distribution have improved Hive performance tens of times in some instances, and the company is aiming for 100x improvements soon. Hortonworks has also added support for new types of SQL data. Competitor Cloudera opted to forgo Hive in favor of its own Impala technology for interactive queries.