Google might be upsetting a lot of people with some of its recent “spring cleaning,” but its latest batch of updates to BigQuery should make data analysts happy, at least.
With the latest updates — announced in a blog post by BigQuery Product Manager Ju-kay Kwek on Thursday — users can now join large tables, import and query timestamped data, and aggregate large collections of distinct values. It’s hardly the equivalent of Google launching Compute Engine last summer, but as (arguably) the inspiration for the SQL-on-Hadoop trend that’s sweeping the big data world right now, every improvement to BigQuery is notable.
BigQuery is a cloud service that lets users analyze terabyte-sized data sets using SQL-like queries. It’s based on Google’s Dremel querying system, which can analyze data where it’s located (i.e., in the Google File System or BigTable) and which Google uses internally to analyze a variety of different data sets. Google claims queries in BigQuery run at interactive speeds, which is something that MapReduce — the previous-generation tool for dealing with such large data sets — simply couldn’t handle within a reasonable time frame or level of complexity. Of course, if you want to schedule batch jobs, BigQuery lets you do that, too, for a lower price.
This constraint — and therefore the potential benefits of something like Dremel and its commercial incarnation, BigQuery — wasn’t lost on the Hadoop community, which itself had been largely reliant on MapReduce processing for years. In the past year, we’ve seen numerous startups and large vendors pushing their own Dremel-like (or MPP-like) technologies for data sitting in the Hadoop Distributed File System. If you happen to be in New York next week, you can hear some of the pioneers in this space talk about it at our Structure: Data conference.
Background aside, the ability to join large data sets in BigQuery is probably the most-important of the three new functions. Joins are an essential aspect of data analysis in most environments because pieces of data that are relevant to each other don’t always reside within the same table or even within the same cluster. And joining tables of the size BigQuery is designed for can take a long time without the right query engine in place.
Kwek offers an anecdote from Google that shows why joins, and the new aggregation function, are important:
[W]hen our App Engine team needed to reconcile app billing and usage information, Big JOIN allowed the team to merge 2TB of usage data with 10GB of configuration data in 60 seconds. Big Group Aggregations enabled them to immediately segment those results by customer. Using the integrated Tableau client the team was able to quickly visualize and detect some unexpected trends.