I’ve written a lot about SQL-on-Hadoop, but so have lots of other folks who cover Hadoop and big data. The story is actually pretty simple. It goes something like this:
History of SQL-on-Hadoop, Part I
1. In an effort to give people with SQL skill sets access to Hadoop data, Apache Hive was born. ODBC and JDBC drivers for Hive, in turn, gave BI tools connectivity to Hadoop data, too.
2. But Hive worked over MapReduce, and was therefore batch-oriented and too slow to work well with BI tools. An interactive tool was needed.
3. In the fall of 2012, Cloudera introduced its Impala product: an open-source Massively Parallel Processing (MPP) SQL query engine that works directly over HDFS. It was interactive. It was faster than Hive, and it picked up adoption with BI tool vendors very quickly. At this point it is even included in Amazon’s Hadoop distribution, and MapR’s as well.
4. In the two years that elapsed since Impala was released, virtually every major database and Hadoop vendor has introduced its own SQL-on-Hadoop solution. Hortonworks has even led an effort, dubbed the Stinger Initiative, to enable Hive as an interactive SQL-on-Hadoop engine by integrating it with Tez and YARN in Hadoop 2.0.
Little Drill all growed up
Got all that? If so, then you might be wondering why another open-source development team, anchored by personnel at MapR, have been working on another SQL-on-Hadoop engine, called Apache Drill.
What’s worth noting about Drill is that it’s not just another SQL-on-Hadoop product. Instead, it’s a schema-agnostic, format-agnostic SQL query “head.” Yes, it can query data stored in HDFS, including CSV and JSON files, as well as HBase tables. It also has a dialect of SQL that can handle hierarchy (necessary for JSON and HBase data).
Perhaps more important than all of that, though, is that Drill can query these files cold. In other words, it can take a SQL query that references specific “columns” and run it directly on a file it’s never seen before.
The incredible un-lightness of schema
Let me explicate that a bit, so that you don’t have to feel like you’re not catching on (or, perhaps more likely, that I’m just being unclear). Most SQL-on-Hadoop systems, including Hive, require tables to be created, using SQL Data Definition Language (DDL) commands, which explicitly declare the schema of the table. These commands specify the name and data type of all columns and essentially map these columns to specific parts of the source data file.
This is often a big step: It requires the person wanting to query the data to do a lot of planning even before the first query is executable. This is a disincentive from querying the data. And even for resilient folks who wouldn’t be put off by this step, changing the schema later is non-trivial. All of this is precisely the antithesis of what data discovery and analytics should be about.
Drill, on the other hand, does not require schema to be declared. It let’s a user refer to columns by number and to alias them (i.e., provide your own column names on the fly). This is a very powerful construct. It allows files to be queried with SQL casually. And it allows the same files to be requeried, using different schema premises, without disruption.
So Drill isn’t just about SQL-on-Hadoop. It’s about SQL-on-pretty-much-anything, immediately, and without formality. It’s less about making files look like relational tables and more about using SQL as a skeleton key to read files in their native form. With enough support for new formats, Drill could become SQL-on-anything. And a simple SELECT * query lets the data discovery begin.
Big boy pants
Drill, however, is still fragile. Installing it can be tricky, as can getting it to work with different file types. Its implementation of SQL is still abridged, and its tooling is still early-stage. It’s a tool that removes friction from data discovery but still introduces some of its own into getting up and running.
That’s to be expected for an Apache Incubator project. Now that Drill is top-level it should enjoy broader adoption and participation, which can provide an easier path to fit, finish, and bullet-proofing. But being anointed a top-level project also raises the pressure for a high level of refinement to come into play. With that in mind, Drill 0.7 should be out shortly, and a 1.0 release is planned for early next year.
Let’s see what happens now. I wish the Drill team every success. You should too.
If you want to try Drill yourself, the easiest way to do so may be via MapR’s Sandbox, which is a virtual machine image (in VMware and Virtual Box formats) that includes a single node installation of the full MapR distribution of Hadoop, including Drill. If you want Drill to succeed, you should work with it and provide feedback to the team so Drill’s developers can make it better.