There was a time, a little over two years ago, when SQL-on-Hadoop was about cracking open access to Hadoop data for those with SQL skillsets and eliminating the exclusivity of access that Hadoop/MapReduce specialists had on the data. Yes, some architectural details – like whether the SQL engine was hitting the data nodes in the Hadoop cluster directly – were important too. But, for the most part, solutions in the space were neatly summed up by the name: SQL, on Hadoop.
Today, SQL-on-Hadoop solutions are best judged not by their SQL engines per se, but instead by the collaborative scenarios they enable between Hadoop and the conventional data warehouse. Hadoop can be seen as a usurper, peer or peripheral of the data warehouse; the SQL-on-Hadoop engine you use determines which one (or more) of these three roles Hadoop can be implemented to fulfill.
In Gigaom Research’s just-published Sector Roadmap: Hadoop/Data Warehouse Interoperability, analyst George Gilbert investigates the SQL-on-Hadoop market, evaluating six solutions, each along six “disruption vectors” or key trends that will affect the market and players over the next year: schema flexibility, data engine interoperability, pricing model, enterprise manageability, workload role optimization and query engine maturity.
As a backdrop to the evaluation of various SQL-on-Hadoop products along these vectors, Gilbert identifies three key analytics usage scenarios. The first is the core data warehouse, a familiar concept for many tech professionals: a relatively expensive appliance-based database platform serving up highly-curated data, with the data’s structure optimized for the kinds of queries the business believes it needs to run.
The second is the so-called “data lake” (called an “enterprise data hub” by some vendors). Here, Hadoop serves as a collecting point for disparate data sources along the full spectrum of unstructured, semi-structured and fully-structured data. Hadoop 2.0’s YARN resource manager facilitates the use of a variety of analysis engines to explore the lake’s data in an ad hoc fashion, and the data warehouse is relieved of this responsibility, free to serve the production queries for which it was designed and tuned.
The third scenario Gilbert identifies is one he calls the “adjunct data warehouse,” wherein various data warehouse tasks – including ETL and reporting – are offloaded from the conventional data warehouse to Hadoop. In fact, the adjunct data warehouse can and should be used to perform these functions on data first explored in the data lake.
In effect, the core data warehouse, adjunct data warehouse and data lake constitute a data processing hierarchy, with a corresponding hierarchy of cost. The hierarchical selection of platforms enables tasks of lower production value (though, arguably, higher business value) to be processed on cheaper platforms – yielding much higher efficiency for enterprise organizations.
How much cheaper? Gilbert notes that Hadoop costs at least an order of magnitude less, per terabyte of data, than appliance-based data warehouses. As Hadoop enables the data lake and adjunct data warehouse scenarios, implementation of them gives Hadoop a significant and demonstrable return-on-investment for enterprise customers.
An open question is whether and when Hadoop can and will serve in a core data warehouse capacity as well. And if it does, will that help the data warehouse vendors, the Hadoop distribution vendors or both? Indeed, this dynamic may be a predictor of future acquisitions of the distribution vendors by the legacy players — or perhaps even the reverse.