Airbnb open sources SQL tool built on Facebook’s Presto database

Apartment-sharing startup Airbnb has open sourced a tool called Airpal that the company built to give more of its employees access to the data they need for their jobs. Airpal is built atop the Presto SQL engine that Facebook created in order to speed access to data stored in Hadoop.

Airbnb built Airpal about a year ago so that employees across divisions and roles could get fast access to data rather than having to wait for a data analyst or data scientist to run a query for them. According to product manager James Mayfield, it’s designed to make it easier for novices to write SQL queries by giving them access to a visual interface, previews of the data they’re accessing, and the ability to share and reuse queries.

It sounds a little like the types of tools we often hear about inside data-driven companies like Facebook, as well as the new SQL platform from a startup called Mode.

At this point, Mayfield said, “Over a third of all the people working at Airbnb have issued a query through Airpal.” He added, “The learning curve for SQL doesn’t have to be that high.”

He shared the example of folks at Airbnb tasked with determining the effectiveness of the automated emails the company sends out when someone books a room, resets a password or takes any of a number of other actions. Data scientists used to have to dive into Hive — the SQL-like data warehouse framework for Hadoop that [company]Facebook[/company] open sourced in 2008 — to answer that type of question, which meant slow turnaround times because of human and technological factors. Now, lots of employees can access that same data via Airpal in just minutes, he said.

The Airpal user interface.

The Airpal user interface.

As cool as Airpal might be for Airbnb users, though, it really owes its existence to Presto. Back when everyone was using Hive for data analysis inside Hadoop — it was and continues to be widely used within web companies — only 10 to 15 people within Airbnb understood the data and could write queries using its somewhat complicated version of SQL. Because Hive is based on MapReduce, the batch-processing engine most commonly associated with Hadoop, Hive is also slow (although new improvements have increased its speed drastically).

Airbnb also used [company]Amazon[/company]’s Redshift cloud data warehouse for a while, said software engineer Andy Kramolisch, and while it was fast, it wasn’t as user-friendly as the company would have liked. It also required replicating data from Hive, meaning more work for Airbnb and more data for the company to manage. (If you want to hear more about all this Hadoop and big data stuff from leaders at [company]Google[/company], Cloudera and elsewhere, come to our Structure Data conference March 18-19 in New York.)

A couple years ago, Facebook created and then open sourced Presto as a means to solve Hive’s speed problems. It still accesses data from Hive, but is designed to deliver results at interactive speeds rather than in minutes or, depending on the query, much longer. It also uses standard ANSI SQL, which Kramolisch said is easier to learn than the Hive Query Language and its “lots of hidden gotchas.”

Still, Mayfield noted, it’s not as if everyone inside Airbnb, or any company, is going to be running SQL queries using Airpal — no matter how easy the tooling gets. In those cases, he said, the company tries to provide dashboards, visualizations and other tools to help employees make sense of the data they need to understand.

“I think it would be rad if the CEO was writing SQL queries,” he said, “but …”

Data might be the new oil, but a lot of us just need gasoline

One of the biggest tropes in the era of big data is that data is the new oil — it’s very valuable to the companies that have it, but only after it has been mined and processed. The analogy makes some sense, but it ignores the fact that people and companies don’t have the means to collect the data they need or the ability to process it once they have it. A lot of us just need gasoline.

Which is why I was excited to see the new Data for Everyone initiative that crowdsourcing startup CrowdFlower released on Wednesday. It’s a library of interesting and free datasets that have been gathered by CrowdFlower’s users over the years and verified by the company’s crowdsourced labor force. Topics range from Twitter sentiment on various subjects to a collection of labeled medical images.

Data for Everyone is far from comprehensive or from being any sort of one-stop shop for data democratization, but it is a good approach to a problem that lots of folks have been trying to solve for years. Namely, giving people interested in analyzing valuable data access to that data in a meaningful way. Unfortunately, early attempts at data marketplaces such as Infochimps and Quandl, and even earlier incarnations of the federal Data.gov service, often included poorly formatted data or suffered from a dearth of interesting datasets.

An example of what's available in Data for Everyone.

An example of what’s available in Data for Everyone.

It’s often said that data analysts spend 85 percent of their time formatting data and only 15 percent of it actually analyzing data — a situation that is simply untenable for people whose jobs don’t revolve around data, even as tools for data analysis continue to improve. All the Tableau software or Watson Analytics or DataHero or PowerBI services in the world don’t do a whole lot to help mortals analyze data when it’s riddled with errors or formatted so sloppily it takes a day just to get it ready to upload.

Hopefully, we’ll start to see more high-quality data markets pop up, as well as better tools for collecting data from services such as Twitter. They don’t necessarily need to be so easy a 10-year-old can use them, but they do need to be easy enough that someone with basic programming or analytic skills can get up and running without quitting their day job. Data for Everyone looks like one, as does the new Wolfram Data Drop, also announced on Wednesday.

Because while it’s getting a lot easier for large companies and professional data scientists to collect their data and analyze it for purposes ranging from business intelligence to training robotic brains — topics we’ll be discussing at our Structure Data conference later this month — the little guy, strapped for time and resources, still needs more help.

The rise of self-service analytics, in 3 charts

I’m trying really hard to write less about business intelligence and analytics software. We get it: Data is important to businesses, and the easier you can make it for people to analyze it, the more they’ll use your software to do it. What more is there to say?

But every time I see Tableau Software’s earnings reports, I’m struck by the reality of how big a shift the business intelligence market is undergoing right now. In the fourth quarter, Tableau grew its revenue 75 percent year over year. People and departments are lining up to buy what’s often called self-service analytics software — that is, applications so easy even those lay business users can work with them without much training — and they’re doing it at the expense of incumbent software vendors.

Some analysts and market insiders will say the new breed of BI vendors are more about easy “data discovery” and that their products lack the governance and administrative control of incumbent products. That’s like saying Taylor Swift is very cute and very good at making music people like, but she’s not as serious as Alanis Morrisette or as artistic as Björk. Those things can come in time; meanwhile, I’d rather be T-Swift raking in millions and looking to do it for some time to come.

[dataset id=”914729″]

Above a quick comparison of annual revenue for three companies, the only three “leaders” in Gartner’s 2014 Magic Quadrant for Business Intelligence and Analytics Platforms (available in the above hyperlink) that are both publicly traded and focused solely on BI. Guess which two fall into the next-generation, self-service camp and are also Gartner’s two highest-ranked. Guess which one is often credited with reimagining the data-analysis experience and making a product people legitimately like using.

[dataset id=”914747″]

Narrowing it just to last year, Tableau’s revenue grew 92 percent between the first and fourth quarters, while Qlik’s grew 65 percent. Microstrategy stayed relatively flat and is trending downward. It’s fourth quarter was actually down year over year.

[dataset id=”914758″]

And what does Wall Street think about what’s happening? [company]Tableau[/company] has the least revenue for now, but probably not much longer, and has a market cap more than [company]Qlik[/company] and [company]Microstrategy[/company] combined.

Here are a few more data points that show how impressive’s Tableau’s ongoing coup really is. Tibco Software, another Gartner leader and formerly public company, recently sold to private equity firm Vista for $4.2 billion after disappointing shareholders with weak sales. Hitachi Data Systems is buying Pentaho, a BI vendor hanging just outside the border of Gartner’s “leader” category, for just more than $500 million, I’m told.

A screenshot from a sample PowerBI dashboard.

A screenshot from a sample PowerBI dashboard.

Although it’s worth noting that Tableau isn’t guaranteed anything. As we speak, startups such as Platfora, ClearStory and SiSense trying to match or outdo Tableau on simplicity while adding their own new features elsewhere. The multi-billion-dollar players are also stepping up their games in this space. [company]Microsoft[/company] and [company]IBM[/company] recently launched the natural-language-based PowerBI and Watson Analytics services that Microsoft says represent the third wave of BI software (Tableau is in the second wave, by its assessment), and [company]Salesforce.com[/company] invested a lot of resources to make its BI foray.

Whatever you want to call it — data discovery, self-service analytics, business intelligence — we’ll be talking more about it at our Structure Data conference next month. Speakers include Tableau Vice President of Analytics (and R&D leader) Jock Mackinlay, as well as Microsoft Corporate Vice President of Machine Learning Joseph Sirosh, who’ll be discussing self-service machine learning.

Hands on with Watson Analytics: Pretty useful when it’s working

Last month, [company]IBM[/company] made available the beta version of its Watson Analytics data analysis service, an offering first announced in September. It’s one of IBM’s only recent forays into anything resembling consumer software, and it’s supposed to make it easy for anyone to analyze data, relying on natural language processing (thus the Watson branding) to drive the query experience.

When the servers running Watson Analytics are working, it actually delivers on that goal.

Analytic power to the people

Because I was impressed that IBM decided to a cloud service using the freemium business model — and carrying the Watson branding, no less — I wanted to see firsthand how well Watson Analytics works. So I uploaded a CSV file including data from Crunchbase on all companies categorized as “big data,” and I got to work.

Seems like a good starting point.

watson14Choose one and get results. The little icon in the bottom left corner makes it easy to change chart type. Notice the various insights included in the bar at the top. Some are more useful than others.

watson15But which companies have raised the most money? Cloudera by a long shot.

watson18

I know Cloudera had a huge investment round in 2014. I wonder how that skews the results for 2014, so I filter it out.

watsonlast

And, voila! For what it’s worth, Cloudera also skews funding totals however you sort them — by year founded, city, month of funding, you name it.

watsonlast2

Watson analytics also includes tools for building dashboards and for predictive analysis. The latter could be particularly useful, although that might depend on the dataset. I analyzed Crunchbase data to try and determine what factors are most predictive of a company’s operating status (whether it has shut down, has been acquired or is still running), and the results were pretty obvious (if you can’t read the image, it lists “last funding” as a big predictor).

watsonpredict3

If I have one big complaint about Watson Analytics, it’s that it’s still a bit buggy — the tool to download charts as images doesn’t seem to work, for example, and I had to reload multiple pages because of server errors. I’d be pretty upset if I were using the paid version, which allows for more storage and larger files, and experienced the same issues. Adding variables to a view without starting over could be easier, too.

Regarding the cloud connection, I rather like what [company]Tableau[/company] did with its public version by pairing a locally hosted application with cloud-based storage. If you’re not going to ensure a consistent backend, it seems better to guarantee some level of performance by relying on the user’s machine.

All in all, though, Watson Analytics seems like a good start to a mass-market analytics service. The natural language aspect makes it at least as intuitive as other services I’ve used (a list that includes DataHero, Tableau Public and Google Fusion tables, among others) and it’s easy enough to run and visualize simple analyses. But Watson Analytics plays in a crowded space that includes the aforementioned products, as well as Microsoft Excel and PowerBI, and Salesforce Wave.

If IBM can work out some of the kinks and add some more business-friendly features — such as the upcoming abilities to refine datasets and connect to data sources — it could be onto something. Depending on how demand for mass-market analytics tools shapes up, there could be plenty of business to go around for everyone, or a couple companies that master the user experience could own the space.

How the right tools can create data-driven companies, even at Facebook

The co-founders of analytics startup Interana came on the Structure Show podcast this week to talk about how to spread data analysis throughout customer accounts, the types of things you can do with event data, and the experience of starting a company with your spouse.

IBM goes freemium with new natural language analytics service

IBM is getting into the freemium space, targeting individual business users with a new data analysis service called Watson Analytics. It helps users analyze data using natural language queries, and could help IBM fend off the myriad products threatening its analytics business from the bottom up.

DataHero raises $3.1M and revamps its analytics service

DataHero, a startup targeting individuals who want an analytics experience much simpler than what most BI software can provide, has raised an extended seed round of $3.15 million and has redesigned its product based on behavioral analysis.