Report: Hadoop in the enterprise: how to start small and grow to success

Our library of 1700 research reports is available only to our subscribers. We occasionally release ones for our larger audience to benefit from. This is one such report. If you would like access to our entire library, please subscribe here. Subscribers will have access to our 2017 editorial calendar, archived reports and video coverage from our 2016 and 2017 events.
herd of elephants
Hadoop in the enterprise: How to start small and grow to success by Paul Miller:
Hadoop-based solutions are increasingly encroaching on the traditional systems that still dominate the enterprise-IT landscape. While Hadoop has proved its worth, neither its wholesale replacement of existing systems nor the expensive and unconstrained build-out of a parallel and entirely separate IT stack make good sense for most businesses. Instead, Hadoop should normally be deployed alongside existing IT and within existing processes, workflows, and governance structures. Rather than initially embarking on a completely new project in which return on investment may prove difficult to quantify, there is value in identifying existing IT tasks that Hadoop may demonstrably perform better than the existing tools. ELT offload from the traditional enterprise data warehouse (EDW) represents one clear use case in which Hadoop typically delivers quick and measurable value, familiarizing enterprise-IT staff with the tools and their capabilities, persuading management of their demonstrable value, and laying the groundwork for more-ambitious projects to follow. This paper explores the pragmatic steps to be taken in introducing Hadoop into a traditional enterprise-IT environment, considers the best use cases for early experimentation and adoption, and discusses the ways Hadoop can then move toward mainstream deployments as part of a sustainable enterprise-IT stack.
To read the full report click here.

Report: Understanding the Power of Hadoop as a Service

Our library of 1700 research reports is available only to our subscribers. We occasionally release ones for our larger audience to benefit from. This is one such report. If you would like access to our entire library, please subscribe here. Subscribers will have access to our 2017 editorial calendar, archived reports and video coverage from our 2016 and 2017 events.
Hadoop-elephant_rgb
Understanding the Power of Hadoop as a Service by Paul Miller:
Across a wide range of industries from health care and financial services to manufacturing and retail, companies are realizing the value of analyzing data with Hadoop. With access to a Hadoop cluster, organizations are able to collect, analyze, and act on data at a scale and price point that earlier data-analysis solutions typically cannot match.
While some have the skill, the will, and the need to build, operate, and maintain large Hadoop clusters of their own, a growing number of Hadoop’s prospective users are choosing not to make sustained investments in developing an in-house capability. An almost bewildering range of hosted solutions is now available to them, all described in some quarters as Hadoop as a Service (HaaS). These range from relatively simple cloud-based Hadoop offerings by Infrastructure-as-a-Service (IaaS) cloud providers including Amazon, Microsoft, and Rackspace through to highly customized solutions managed on an ongoing basis by service providers like CSC and CenturyLink. Startups such as Altiscale are completely focused on running Hadoop for their customers. As they do not need to worry about the impact on other applications, they are able to optimize hardware, software, and processes in order to get the best performance from Hadoop.
In this report we explore a number of the ways in which Hadoop can be deployed, and we discuss the choices to be made in selecting the best approach for meeting different sets of requirements.
To read the full report, click here.

Microsoft embraces Python, Linux in new big data tools

Continuing its quest to make Microsoft Azure comfy for the non-Windows world, Microsoft just launched a preview of its Hadoop-based cloud tool (HDInsight) that runs on Linux. It’s also making its Azure ML machine learning service widely available now with new support for Python as well as the already-planned support for the popular R language. Microsoft bought Revolution Analytics, the company behind a commercial version of R, last month.

Azure HDInsight is thus “Microsoft’s first fully Linux-based service for big data,” Joseph Sirosh, Microsoft’s corporate VP of machine learning, said in an interview. Microsoft says 20 percent of all VMs running on Azure run Linux.

Asked if he sees any open-source oriented developers still wary of using Microsoft’s cloud, Sirosh said the perception of Microsoft as a Windows-only company is fading. “There is a new breed of developers [who want] to leverage features … whether they are Linux- or Windows-based is becoming less important,” he said. With cloud services, “you really don’t have to know a lot about deep inner details to use these services.”

Azure ML’s embrace of Python also shows just how popular that language has become and that [company]Microsoft[/company] Azure is building on its promise of language agnosticism. “Python has become the number one language of choice for developers. We can now claim to be the most comprehensive analytics service — no other product lets you integrate SQL, R and Python into one project,” Sirosh said.

Microsoft CEO Satya Nadella.

Microsoft CEO Satya Nadella

Microsoft is also making Storm, the open-source stream analytics tool, available for HDInsight with support for both .NET and Java. The company already offered Azure Stream Analytics and will continue to sell, support and upgrade that as well. Storm is another option, Sirosh said.

In the massive public cloud infrastructure arena, Microsoft must contend with [company]Amazon[/company] Web Services and [company]Google[/company] Cloud Platform, both of which are targeting developers with fancy analytics and other services. I agree with Sirosh that Microsoft has done a good job of embracing open-source frameworks and languages in Azure. But the perception, especially among young startups, of Microsoft as a Windows-and-Office-first monolith dies hard.

I’ll be sure to ask Sirosh more about how Microsoft Azure can win over startups as well as big business accounts when we’re on stage next month at Structure Data.

This story was updated at 10:05 a.m. PST to reflect Microsoft’s assertion that 20 percent of all VMs on Azure run Linux

Hadoop needs a better front-end for business users

Whether you’re running it on premises or in the cloud, Hadoop leaves a lot to be desired in the ease-of-use department. The Hadoop offerings on the three major cloud platforms (Amazon’s Elastic MapReduce  — EMR, Microsoft’s Azure HDInsight and Google Compute Engine’s Click-to-Deploy Hadoop) have their warts. And the three major on-premises distributions (Cloudera CDH, Hortonworks HDP and MapR) can be formidable adversaries to casual users as well.

See prompt

The root of Hadoop’s ease-of-use problem, no matter where you run it, is that it’s essentially a command line tool. In the enterprise, people are used to graphical user interfaces (GUIs), be they in desktop applications or in the Web browser, that make things fairly simple to select, configure, and run. To the highly technical people who were Hadoop’s early adopters, the minimalism of the command prompt has a greater purity and “honesty” than a GUI. But, while there’s no reason to demonize people who feel this way, command line tools just won’t fly with business users in the enterprise.

Amazon and Google seem to aim their services at well-initiated Hadoop jocks. And with that premise in place, their offerings are fine. But that premise isn’t a good one, frankly, if mainstream adoption is what these companies are looking for. Microsoft’s HDInsight does at least allow for simplified access to Hadoop data via a Hive GUI. This allows for entry of Hive queries (complete with syntax coloring), monitoring of job progress, and viewing of the query output. If you want to do more than that, you get to visit the magical land of PowerShell, Microsoft’s system scripting environment. Woo. Hoo.

Adjusting the Hue

Amazon now allows you to install “Hue,” an open source, browser-based GUI for Hadoop, on an EMR Hadoop cluster. Hue is probably the most evolved GUI offering out there on Hadoop. However, getting Hue working on EMR involves some security configuration in order to open up Web access to your cluster. And let’s just say that it’s far from straightforward.

Hue is available in the major on-premises Hadoop distributions as well. It provides front-ends for creating and running MapReduce jobs, working with Hive, Pig, HBase, the Solr search interface and more. But Hue’s Pig interface is really just a script editor with an run button, and its HBase interface is just a table browser. Whether on-site or in the cloud, Hadoop needs much more advanced tooling in order to get to groundswell status. Developers need to be able to connect to Hadoop from their integrated development environments (IDEs) and cloud control panels need to make the logistics of working with Hadoop far simpler.

“A” for effort

To its credit, Microsoft is making some important moves here. The preview release of its Visual Studio 2015 IDE includes both Hive integration into its Server Explorer tool, and “tooling to create Hive queries and submit them as jobs” according to a post on the Azure blog. It even now includes a browser for the Azure storage containers that back HDInsight clusters, and a special Hive project template.

Beyond that, the Web-based portal for HDInsight has a “Getting Started Gallery” section that provides task-based options for processing data and analyzing it in Excel. That is key: Hadoop will become truly mainstream only when people can use it as a means to an end. In order for that to happen, the tooling around it needs to be oriented to the workflows of business users (and data professionals), with the appropriate components invoked at the appropriate times. However, the HDInsight Getting Started Gallery items are more guided walkthroughs than they are truly automated tasks.

Hadoop at your service

Beyond the cloud providers’ own Hadoop products lie Hadoop-as-a-Service (yes, that would be HaaS) offerings from companies like Qubole and Altiscale, which let you work as if you had your own cluster at your beck and call. Qubole provides its own component-oriented GUI and API, and in turn deploys clusters for you on AWS, Google Compute Engine and, as of November 18th, Microsoft Azure as well. Altiscale runs its own cloud and provides its own workbench, which is a command line interface that users connect to via SSH.

HaaS offerings will likely grow in popularity. The incumbent cloud providers may need to adopt the as-a-service paradigm in their own Hadoop offerings, in terms of user interface and/or pricing models. Meanwhile, developer tools providers should integrate Hadoop into their IDEs as Microsoft has started to do. And everyone in the Hadoop industry needs to embed Hadoop components in their menu options and features, rather than just expose these components in a pass-through fashion.

Right now, Hadoop tools let you pick ingredients from the pantry, prepare them, and combine them on your own. In the end, with the right knowledge and skill, you do get a meal. But the tooling shouldn’t be just for chefs; it should be for diners too. Business users need to set up a project easily, configure its options, and then let the tools bring to bear the appropriate Hadoop components in the right combination. That’s how Hadoop will have a chance to make it as an enterprise main course.

Hadoop needs a better front-end for business users

Whether you’re running it on premises or in the cloud, Hadoop leaves a lot to be desired in the ease-of-use department. The Hadoop offerings on the three major cloud platforms (Amazon’s Elastic MapReduce  — EMR, Microsoft’s Azure HDInsight and Google Compute Engine’s Click-to-Deploy Hadoop) have their warts. And the three major on-premises distributions (Cloudera CDH, Hortonworks HDP and MapR) can be formidable adversaries to casual users as well.

See prompt

The root of Hadoop’s ease-of-use problem, no matter where you run it, is that it’s essentially a command line tool. In the enterprise, people are used to graphical user interfaces (GUIs), be they in desktop applications or in the Web browser, that make things fairly simple to select, configure, and run. To the highly technical people who were Hadoop’s early adopters, the minimalism of the command prompt has a greater purity and “honesty” than a GUI. But, while there’s no reason to demonize people who feel this way, command line tools just won’t fly with business users in the enterprise.

Amazon and Google seem to aim their services at well-initiated Hadoop jocks. And with that premise in place, their offerings are fine. But that premise isn’t a good one, frankly, if mainstream adoption is what these companies are looking for. Microsoft’s HDInsight does at least allow for simplified access to Hadoop data via a Hive GUI. This allows for entry of Hive queries (complete with syntax coloring), monitoring of job progress, and viewing of the query output. If you want to do more than that, you get to visit the magical land of PowerShell, Microsoft’s system scripting environment. Woo. Hoo.

Adjusting the Hue

Amazon now allows you to install “Hue,” an open source, browser-based GUI for Hadoop, on an EMR Hadoop cluster. Hue is probably the most evolved GUI offering out there on Hadoop. However, getting Hue working on EMR involves some security configuration in order to open up Web access to your cluster. And let’s just say that it’s far from straightforward.

Hue is available in the major on-premises Hadoop distributions as well. It provides front-ends for creating and running MapReduce jobs, working with Hive, Pig, HBase, the Solr search interface and more. But Hue’s Pig interface is really just a script editor with an run button, and its HBase interface is just a table browser. Whether on-site or in the cloud, Hadoop needs much more advanced tooling in order to get to groundswell status. Developers need to be able to connect to Hadoop from their integrated development environments (IDEs) and cloud control panels need to make the logistics of working with Hadoop far simpler.

“A” for effort

To its credit, Microsoft is making some important moves here. The preview release of its Visual Studio 2015 IDE includes both Hive integration into its Server Explorer tool, and “tooling to create Hive queries and submit them as jobs” according to a post on the Azure blog. It even now includes a browser for the Azure storage containers that back HDInsight clusters, and a special Hive project template.

Beyond that, the Web-based portal for HDInsight has a “Getting Started Gallery” section that provides task-based options for processing data and analyzing it in Excel. That is key: Hadoop will become truly mainstream only when people can use it as a means to an end. In order for that to happen, the tooling around it needs to be oriented to the workflows of business users (and data professionals), with the appropriate components invoked at the appropriate times. However, the HDInsight Getting Started Gallery items are more guided walkthroughs than they are truly automated tasks.

Hadoop at your service

Beyond the cloud providers’ own Hadoop products lie Hadoop-as-a-Service (yes, that would be HaaS) offerings from companies like Qubole and Altiscale, which let you work as if you had your own cluster at your beck and call. Qubole provides its own component-oriented GUI and API, and in turn deploys clusters for you on AWS, Google Compute Engine and, as of November 18th, Microsoft Azure as well. Altiscale runs its own cloud and provides its own workbench, which is a command line interface that users connect to via SSH.

HaaS offerings will likely grow in popularity. The incumbent cloud providers may need to adopt the as-a-service paradigm in their own Hadoop offerings, in terms of user interface and/or pricing models. Meanwhile, developer tools providers should integrate Hadoop into their IDEs as Microsoft has started to do. And everyone in the Hadoop industry needs to embed Hadoop components in their menu options and features, rather than just expose these components in a pass-through fashion.

Right now, Hadoop tools let you pick ingredients from the pantry, prepare them, and combine them on your own. In the end, with the right knowledge and skill, you do get a meal. But the tooling shouldn’t be just for chefs; it should be for diners too. Business users need to set up a project easily, configure its options, and then let the tools bring to bear the appropriate Hadoop components in the right combination. That’s how Hadoop will have a chance to make it as an enterprise main course.