Report: Understanding the Power of Hadoop as a Service

Our library of 1700 research reports is available only to our subscribers. We occasionally release ones for our larger audience to benefit from. This is one such report. If you would like access to our entire library, please subscribe here. Subscribers will have access to our 2017 editorial calendar, archived reports and video coverage from our 2016 and 2017 events.
Understanding the Power of Hadoop as a Service by Paul Miller:
Across a wide range of industries from health care and financial services to manufacturing and retail, companies are realizing the value of analyzing data with Hadoop. With access to a Hadoop cluster, organizations are able to collect, analyze, and act on data at a scale and price point that earlier data-analysis solutions typically cannot match.
While some have the skill, the will, and the need to build, operate, and maintain large Hadoop clusters of their own, a growing number of Hadoop’s prospective users are choosing not to make sustained investments in developing an in-house capability. An almost bewildering range of hosted solutions is now available to them, all described in some quarters as Hadoop as a Service (HaaS). These range from relatively simple cloud-based Hadoop offerings by Infrastructure-as-a-Service (IaaS) cloud providers including Amazon, Microsoft, and Rackspace through to highly customized solutions managed on an ongoing basis by service providers like CSC and CenturyLink. Startups such as Altiscale are completely focused on running Hadoop for their customers. As they do not need to worry about the impact on other applications, they are able to optimize hardware, software, and processes in order to get the best performance from Hadoop.
In this report we explore a number of the ways in which Hadoop can be deployed, and we discuss the choices to be made in selecting the best approach for meeting different sets of requirements.
To read the full report, click here.

Hadoop needs a better front-end for business users

Whether you’re running it on premises or in the cloud, Hadoop leaves a lot to be desired in the ease-of-use department. The Hadoop offerings on the three major cloud platforms (Amazon’s Elastic MapReduce  — EMR, Microsoft’s Azure HDInsight and Google Compute Engine’s Click-to-Deploy Hadoop) have their warts. And the three major on-premises distributions (Cloudera CDH, Hortonworks HDP and MapR) can be formidable adversaries to casual users as well.

See prompt

The root of Hadoop’s ease-of-use problem, no matter where you run it, is that it’s essentially a command line tool. In the enterprise, people are used to graphical user interfaces (GUIs), be they in desktop applications or in the Web browser, that make things fairly simple to select, configure, and run. To the highly technical people who were Hadoop’s early adopters, the minimalism of the command prompt has a greater purity and “honesty” than a GUI. But, while there’s no reason to demonize people who feel this way, command line tools just won’t fly with business users in the enterprise.

Amazon and Google seem to aim their services at well-initiated Hadoop jocks. And with that premise in place, their offerings are fine. But that premise isn’t a good one, frankly, if mainstream adoption is what these companies are looking for. Microsoft’s HDInsight does at least allow for simplified access to Hadoop data via a Hive GUI. This allows for entry of Hive queries (complete with syntax coloring), monitoring of job progress, and viewing of the query output. If you want to do more than that, you get to visit the magical land of PowerShell, Microsoft’s system scripting environment. Woo. Hoo.

Adjusting the Hue

Amazon now allows you to install “Hue,” an open source, browser-based GUI for Hadoop, on an EMR Hadoop cluster. Hue is probably the most evolved GUI offering out there on Hadoop. However, getting Hue working on EMR involves some security configuration in order to open up Web access to your cluster. And let’s just say that it’s far from straightforward.

Hue is available in the major on-premises Hadoop distributions as well. It provides front-ends for creating and running MapReduce jobs, working with Hive, Pig, HBase, the Solr search interface and more. But Hue’s Pig interface is really just a script editor with an run button, and its HBase interface is just a table browser. Whether on-site or in the cloud, Hadoop needs much more advanced tooling in order to get to groundswell status. Developers need to be able to connect to Hadoop from their integrated development environments (IDEs) and cloud control panels need to make the logistics of working with Hadoop far simpler.

“A” for effort

To its credit, Microsoft is making some important moves here. The preview release of its Visual Studio 2015 IDE includes both Hive integration into its Server Explorer tool, and “tooling to create Hive queries and submit them as jobs” according to a post on the Azure blog. It even now includes a browser for the Azure storage containers that back HDInsight clusters, and a special Hive project template.

Beyond that, the Web-based portal for HDInsight has a “Getting Started Gallery” section that provides task-based options for processing data and analyzing it in Excel. That is key: Hadoop will become truly mainstream only when people can use it as a means to an end. In order for that to happen, the tooling around it needs to be oriented to the workflows of business users (and data professionals), with the appropriate components invoked at the appropriate times. However, the HDInsight Getting Started Gallery items are more guided walkthroughs than they are truly automated tasks.

Hadoop at your service

Beyond the cloud providers’ own Hadoop products lie Hadoop-as-a-Service (yes, that would be HaaS) offerings from companies like Qubole and Altiscale, which let you work as if you had your own cluster at your beck and call. Qubole provides its own component-oriented GUI and API, and in turn deploys clusters for you on AWS, Google Compute Engine and, as of November 18th, Microsoft Azure as well. Altiscale runs its own cloud and provides its own workbench, which is a command line interface that users connect to via SSH.

HaaS offerings will likely grow in popularity. The incumbent cloud providers may need to adopt the as-a-service paradigm in their own Hadoop offerings, in terms of user interface and/or pricing models. Meanwhile, developer tools providers should integrate Hadoop into their IDEs as Microsoft has started to do. And everyone in the Hadoop industry needs to embed Hadoop components in their menu options and features, rather than just expose these components in a pass-through fashion.

Right now, Hadoop tools let you pick ingredients from the pantry, prepare them, and combine them on your own. In the end, with the right knowledge and skill, you do get a meal. But the tooling shouldn’t be just for chefs; it should be for diners too. Business users need to set up a project easily, configure its options, and then let the tools bring to bear the appropriate Hadoop components in the right combination. That’s how Hadoop will have a chance to make it as an enterprise main course.