Hadoop needs a better front-end for business users

Whether you’re running it on premises or in the cloud, Hadoop leaves a lot to be desired in the ease-of-use department. The Hadoop offerings on the three major cloud platforms (Amazon’s Elastic MapReduce  — EMR, Microsoft’s Azure HDInsight and Google Compute Engine’s Click-to-Deploy Hadoop) have their warts. And the three major on-premises distributions (Cloudera CDH, Hortonworks HDP and MapR) can be formidable adversaries to casual users as well.

See prompt

The root of Hadoop’s ease-of-use problem, no matter where you run it, is that it’s essentially a command line tool. In the enterprise, people are used to graphical user interfaces (GUIs), be they in desktop applications or in the Web browser, that make things fairly simple to select, configure, and run. To the highly technical people who were Hadoop’s early adopters, the minimalism of the command prompt has a greater purity and “honesty” than a GUI. But, while there’s no reason to demonize people who feel this way, command line tools just won’t fly with business users in the enterprise.

Amazon and Google seem to aim their services at well-initiated Hadoop jocks. And with that premise in place, their offerings are fine. But that premise isn’t a good one, frankly, if mainstream adoption is what these companies are looking for. Microsoft’s HDInsight does at least allow for simplified access to Hadoop data via a Hive GUI. This allows for entry of Hive queries (complete with syntax coloring), monitoring of job progress, and viewing of the query output. If you want to do more than that, you get to visit the magical land of PowerShell, Microsoft’s system scripting environment. Woo. Hoo.

Adjusting the Hue

Amazon now allows you to install “Hue,” an open source, browser-based GUI for Hadoop, on an EMR Hadoop cluster. Hue is probably the most evolved GUI offering out there on Hadoop. However, getting Hue working on EMR involves some security configuration in order to open up Web access to your cluster. And let’s just say that it’s far from straightforward.

Hue is available in the major on-premises Hadoop distributions as well. It provides front-ends for creating and running MapReduce jobs, working with Hive, Pig, HBase, the Solr search interface and more. But Hue’s Pig interface is really just a script editor with an run button, and its HBase interface is just a table browser. Whether on-site or in the cloud, Hadoop needs much more advanced tooling in order to get to groundswell status. Developers need to be able to connect to Hadoop from their integrated development environments (IDEs) and cloud control panels need to make the logistics of working with Hadoop far simpler.

“A” for effort

To its credit, Microsoft is making some important moves here. The preview release of its Visual Studio 2015 IDE includes both Hive integration into its Server Explorer tool, and “tooling to create Hive queries and submit them as jobs” according to a post on the Azure blog. It even now includes a browser for the Azure storage containers that back HDInsight clusters, and a special Hive project template.

Beyond that, the Web-based portal for HDInsight has a “Getting Started Gallery” section that provides task-based options for processing data and analyzing it in Excel. That is key: Hadoop will become truly mainstream only when people can use it as a means to an end. In order for that to happen, the tooling around it needs to be oriented to the workflows of business users (and data professionals), with the appropriate components invoked at the appropriate times. However, the HDInsight Getting Started Gallery items are more guided walkthroughs than they are truly automated tasks.

Hadoop at your service

Beyond the cloud providers’ own Hadoop products lie Hadoop-as-a-Service (yes, that would be HaaS) offerings from companies like Qubole and Altiscale, which let you work as if you had your own cluster at your beck and call. Qubole provides its own component-oriented GUI and API, and in turn deploys clusters for you on AWS, Google Compute Engine and, as of November 18th, Microsoft Azure as well. Altiscale runs its own cloud and provides its own workbench, which is a command line interface that users connect to via SSH.

HaaS offerings will likely grow in popularity. The incumbent cloud providers may need to adopt the as-a-service paradigm in their own Hadoop offerings, in terms of user interface and/or pricing models. Meanwhile, developer tools providers should integrate Hadoop into their IDEs as Microsoft has started to do. And everyone in the Hadoop industry needs to embed Hadoop components in their menu options and features, rather than just expose these components in a pass-through fashion.

Right now, Hadoop tools let you pick ingredients from the pantry, prepare them, and combine them on your own. In the end, with the right knowledge and skill, you do get a meal. But the tooling shouldn’t be just for chefs; it should be for diners too. Business users need to set up a project easily, configure its options, and then let the tools bring to bear the appropriate Hadoop components in the right combination. That’s how Hadoop will have a chance to make it as an enterprise main course.

Netflix shows off how it does Hadoop in the cloud

Netflix is at it again, this time showing off its homemade architecture for running Hadoop workloads in the Amazon Web Services cloud. It’s all about the flexibility of being able to run, manage and access multiple clusters while eliminating as many barriers as possible.

Pinterest, Flipboard and Yelp tell how to save big bucks in the cloud

At the AWS Re: Invent conference, engineers from Pinterest, Flipboard and Yelp detailed some of the strategies their companies employ in order to keep costs low as computing demand increases. The keys are keeping an eagle eye on usage and using the right types of resources.

Why Amazon thinks big data was made for the cloud

According to Amazon Web Services Chief Data Scientist Matt Wood, big data and cloud computing are nearly a match made in heaven. Limitless, on-demand and inexpensive resources open up new worlds of possibility, and a central platform makes it easy for communities to share huge datasets.

Rackspace versus Amazon: The big data edition

Rackspace is busy building a Hadoop service, giving the company one more avenue to compete with cloud kingpin Amazon Web Services. However, the two services — along with several others on the market — highlight just how different seemingly similar cloud services can be.

All aboard the Hadoop money train

Market research firm IDC released the first legitimate market forecast for Hadoop on Monday, claiming the ecosystem around the de facto big data platform will sell almost $813 million worth of software by 2016. But Hadoop’s actual economic impact is likely much, much larger.

DataSift highlights more limitations in the public cloud

Social data analytics company DataSift left beta this week; one of only two companies licensed to repurpose the “fire hose” of data pouring out of Twitter, DataSift processes over 250 million fresh tweets every day. The company consumes data exclusively from the cloud and provides services to customers via the cloud, which surely makes it a natural choice for hosting its data there. But DataSift actually runs its core operation from a physical data center, highlighting areas in which mainstream cloud offerings remain lacking.
Founder Nick Halstead recognizes that public cloud infrastructure is of value both to DataSift as a company and to the customers. DataSift makes extensive use of Amazon’s public cloud for development and testing, and for processing of large historical data sets. But for its customer-facing service, and for working in real time with the fire hose itself, Halstead remains convinced that the public cloud is not ready.
He stresses that there are parts of DataSift’s business that require it to invest significantly in its own infrastructure. The biggest choke point Halstead sees is around bandwidth. An average of 250 million tweets per day is more than 10 million per hour, with peak loads that are far higher. Each of those tweets carries more data than the 140 visible characters and includes over 40 elements of metadata about tweet and tweeter.
Traditional public cloud solutions such as those from Amazon and Rackspace, Halstead argues, simply cannot guarantee sufficient network bandwidth for his business to function. He goes as far as to suggest that “If you are trying to sell yourself as a cloud platform, you cannot put yourself on someone else’s cloud platform,” because you will be unable to control your customers’ experience. While companies with significant bandwidth requirements of their own, like Netflix, appear content to rely on the public cloud, there is a broader trend in the real-time social media space toward moving off the cloud and into dedicated data centers. Facebook, of course, has even designed its data centers and servers from scratch.
DataSift receives its data from Twitter in a traditional data center, connected to the Internet via a heavily customized set of dedicated Cisco routers and switches. Redundancy in network connectivity seeks to ensure that DataSift receives data quickly and reliably and gets it back to customers just as fast. DataSift’s customers today are paying for real-time perspectives drawn from the entire Twitter fire hose, and they are not prepared to accept unpredictable delays or data loss just because DataSift was unable to quickly pull data over the network.
Twitter itself recognizes the issues that Halstead raises, investing heavily in data centers that can cope with the type of load the site experiences. The company has struggled in the past to handle demand, and it allegedly experienced difficulty finding data center providers that could meet its needs. A constant flow of the relatively small chunks of data that make up tweets, Facebook wall posts and other social media activities challenge traditional data centers designed to process the less-time-sensitive movement of larger digital files. New server designs from companies such as SeaMicro are rising to the challenge, but the network still lags behind.
For DataSift’s use case, Halstead may be correct about the limitations of the public cloud. As there are few instances in which so much data needs to be transferred so quickly so often, it is unlikely that public cloud providers will directly invest in providing the necessary infrastructure. Bandwidth to the cloud’s data centers will increase, but so will the number of competing demands on it. Most applications may be capable of waiting an extra second or two for a data transfer to complete, but for DataSift, every second’s delay is potentially 3,000 lost tweets and a corresponding drop in the company’s credibility.

Question of the week

Could a company like DataSift run on the public cloud today?

How Etsy handcrafted a big data strategy

E-commerce site Etsy has grown to 25 million unique visitors and 1.1 billion page views per month, and it’s generating the data volumes to match. Using tools such as Hadoop and Splunk, Etsy is turning terabytes of data per day into a better product.