Report: Apache Hadoop: Is one cluster enough?

Our library of 1700 research reports is available only to our subscribers. We occasionally release ones for our larger audience to benefit from. This is one such report. If you would like access to our entire library, please subscribe here. Subscribers will have access to our 2017 editorial calendar, archived reports and video coverage from our 2016 and 2017 events.
apache logo
Apache Hadoop: Is one cluster enough? by Paul Miller:
The open-source Apache Hadoop project continues its rapid evolution now and is capable of far more than its traditional use case of running a single MapReduce job on a single large volume of data. Projects like Apache YARN expand the types of workloads for which Hadoop is a viable and compelling solution, leading practitioners to think more creatively about the ways data is stored, processed, and made available for analysis.
Enthusiasm is growing in some quarters for the concept of a “data lake” — a single repository of data stored in the Hadoop Distributed File System (HDFS) and accessed by a number of applications for different purposes. Most of the prominent Hadoop vendors provide persuasive examples of this model at work but, unsurprisingly, the complexities of real-world deployment do not always neatly fit the idealized model of a single (huge) cluster working with a single (huge) data lake.
In this report we discuss some of the circumstances in which more complex requirements may exist, and explore a set of solutions emerging to address them.
To read the full report, click here.

Report: Extending Hadoop Towards the Data Lake

Our library of 1700 research reports is available only to our subscribers. We occasionally release ones for our larger audience to benefit from. This is one such report. If you would like access to our entire library, please subscribe here. Subscribers will have access to our 2017 editorial calendar, archived reports and video coverage from our 2016 and 2017 events.
hadoop-logo
Extending Hadoop Towards the Data Lake by Paul Miller:
The data lake has increasingly become an aspect of Hadoop’s appeal. Referred to in some contexts as an “enterprise data hub,” it now garners interest not only from Hadoop’s existing adopters but also from a far broader set of potential beneficiaries. It is the vision of a single, comprehensive pool of data, managed by Hadoop and accessed as required by diverse applications such as Spark, Storm, and Hive, that offers opportunities to reduce duplication of data, increase efficiency, and create an environment in which data from very different sources can meaningfully be analyzed together.
Fully embracing the opportunity promised by a comprehensive data lake requires a shift in attitude and careful integration with the existing systems and workflows that Hadoop often augments rather than replaces. Existing enterprise concerns about governance and security will certainly not disappear, so suitable workflows must be developed to safeguard data while making it available for newly feasible forms of analysis.
Early adopters in a range of industries are already finding ways to exploit the potential of their data lakes, operationalizing internal analytic processes and integrating rich real-time analyses with more established batch processing tasks. They are integrating Hadoop into existing organizational workflows and addressing challenges around the completeness, cleanliness, validity, and protection of their data.
In this report, we explore a number of the key issues frequently identified as significant in these successful implementations of a data lake.
To read the full report, click here.

Hadoop’s New Role: Adjunct Data Warehouse

There was a time, a little over two years ago, when SQL-on-Hadoop was about cracking open access to Hadoop data for those with SQL skillsets and eliminating the exclusivity of access that Hadoop/MapReduce specialists had on the data. Yes, some architectural details – like whether the SQL engine was hitting the data nodes in the Hadoop cluster directly – were important too. But, for the most part, solutions in the space were neatly summed up by the name: SQL, on Hadoop.

Today, SQL-on-Hadoop solutions are best judged not by their SQL engines per se, but instead by the collaborative scenarios they enable between Hadoop and the conventional data warehouse. Hadoop can be seen as a usurper, peer or peripheral of the data warehouse; the SQL-on-Hadoop engine you use determines which one (or more) of these three roles Hadoop can be implemented to fulfill.

In Gigaom Research’s just-published Sector Roadmap: Hadoop/Data Warehouse Interoperability, analyst George Gilbert investigates the SQL-on-Hadoop market, evaluating six solutions, each along six “disruption vectors” or key trends that will affect the market and players over the next year: schema flexibility, data engine interoperability, pricing model, enterprise manageability, workload role optimization and query engine maturity.

Scenarios

As a backdrop to the evaluation of various SQL-on-Hadoop products along these vectors, Gilbert identifies three key analytics usage scenarios. The first is the core data warehouse, a familiar concept for many tech professionals: a relatively expensive appliance-based database platform serving up highly-curated data, with the data’s structure optimized for the kinds of queries the business believes it needs to run.

The second is the so-called “data lake” (called an “enterprise data hub” by some vendors). Here, Hadoop serves as a collecting point for disparate data sources along the full spectrum of unstructured, semi-structured and fully-structured data. Hadoop 2.0’s YARN resource manager facilitates the use of a variety of analysis engines to explore the lake’s data in an ad hoc fashion, and the data warehouse is relieved of this responsibility, free to serve the production queries for which it was designed and tuned.

The third scenario Gilbert identifies is one he calls the “adjunct data warehouse,” wherein various data warehouse tasks – including ETL and reporting – are offloaded from the conventional data warehouse to Hadoop. In fact, the adjunct data warehouse can and should be used to perform these functions on data first explored in the data lake.

Screen-Shot-2015-01-28-at-9_33_42-AM-e1422475808971

In effect, the core data warehouse, adjunct data warehouse and data lake constitute a data processing hierarchy, with a corresponding hierarchy of cost. The hierarchical selection of platforms enables tasks of lower production value (though, arguably, higher business value) to be processed on cheaper platforms – yielding much higher efficiency for enterprise organizations.

Enterprise ROI

How much cheaper? Gilbert notes that Hadoop costs at least an order of magnitude less, per terabyte of data, than appliance-based data warehouses. As Hadoop enables the data lake and adjunct data warehouse scenarios, implementation of them gives Hadoop a significant and demonstrable return-on-investment for enterprise customers.

An open question is whether and when Hadoop can and will serve in a core data warehouse capacity as well. And if it does, will that help the data warehouse vendors, the Hadoop distribution vendors or both? Indeed, this dynamic may be a predictor of future acquisitions of the distribution vendors by the legacy players — or perhaps even the reverse.