Report: A checklist for stacking up IaaS providers

Our library of 1700 research reports is available only to our subscribers. We occasionally release ones for our larger audience to benefit from. This is one such report. If you would like access to our entire library, please subscribe here. Subscribers will have access to our 2017 editorial calendar, archived reports and video coverage from our 2016 and 2017 events.
IaaS pic #2
A checklist for stacking up IaaS providers by Janakiram MSV:
Infrastructure as a Service (IaaS) is the fastest growing segment of the cloud services market. According to Gigaom Research, the current worldwide cloud market is growing by 126.5 percent year over year, driven by 119 percent growth in SaaS and 122 percent growth in IaaS.
Irrespective of the workload type, the key building blocks of infrastructure are compute, storage, networking, and database. This report focuses on identifying a set of common features for each of the building blocks and comparing them with the equivalent services offered by key players of the IaaS industry. The scope of this report is limited to public clouds and it doesn’t compare private cloud providers offering IaaS.
This report doesn’t attempt to compare the price and performance of the cloud service providers. The IaaS players are dropping their prices so frequently that the captured snapshot would be obsolete by the time this report is published. Since each workload, use case, and scenario differs from customer to customer, providing the technical benchmarking information is not practical. It’s best to pilot your app on several clouds and see what actual performance you get.
To read the full report click here.

Report: The importance of benchmarking clouds

Our library of 1700 research reports is available only to our subscribers. We occasionally release ones for our larger audience to benefit from. This is one such report. If you would like access to our entire library, please subscribe here. Subscribers will have access to our 2017 editorial calendar, archived reports and video coverage from our 2016 and 2017 events.
Windowed City Skyscraper Architecture Beneath Cloudscape in Black and White
The importance of benchmarking clouds by Paul Miller:
For most businesses, the debate about whether to embrace the cloud is over. It is now a question of tactics — how, when, and what kind? Cloud computing increasingly forms an integral part of enterprise IT strategy, but the wide variation in enterprise requirements ensures plenty of scope for very different cloud services to coexist.
Today’s enterprise cloud deployments will typically be hybridized, with applications and workloads running in a mix of different cloud environments. The rationale for those deployment decisions is based on a number of different considerations, including geography, certification, service level agreements, price, and performance.
To read the full report, click here.

A massive database now translates news in 65 languages in real time

I have written quite a bit about GDELT (the Global Database of Events, Languages and Tone) over the past year, because I think it’s a great example of the type of ambitious project only made possible by the advent of cloud computing and big data systems. In a nutshell, it’s database of more than 250 million socioeconomic and geopolitical events and their metadata dating back to 1979, all stored (now) in Google’s cloud and available to analyze for free via Google BigQuery or custom-built applications.

On Thursday, version 2.0 of GDELT was unveiled, complete with a slew of new features — faster updates, sentiment analysis, images, a more-expansive knowledge graph and, most importantly, real-time translation across 65 different languages. That’s 98.4 percent of the non-English content GDELT monitors. Because you can’t really have a global database, or expect to get a full picture of what’s happening around the world, if you’re limited to English language sources or exceedingly long turnaround times for translated content.

For a quick recap of GDELT, you can read the story linked to above, as well as our coverage of project creator Kalev Leetaru’s analyses of the Arab Spring and Ukrainian crisis and the Ebola outbreak. For a deeper understanding of the project and its creator –who also helped measure the “Twitter heartbeat” and uploaded millions of images from the Internet Archive’s digital book collection to Flickr — check our Structure Show podcast interview with Leetaru from August (embedded below). He’ll also be presenting on GDELT and his future plans at our Structure Data conference next month.


An time-series analysis of the Arab Spring compared with similar periods since 1979.

Leetaru explains GDELT 2.0’s translation system in some detail in a blog post, but even at a high level the methods it uses to achieve near real-time speed are interesting. It works sort of like buffering does on Netflix:

“GDELT’s translation system must be able to provide at least basic translation of 100% of monitored material every 15 minutes, coping with sudden massive surges in volume without ever requiring more time than the 15 minute window. This ‘streaming’ translation is very similar to streaming compression, in which the system must dynamically modulate the quality of its output to meet time constraints: during periods with relatively little content, maximal translation accuracy can be achieved, with accuracy linearly degraded as needed to cope with increases in volume in order to ensure that translation always finishes within the 15 minute window. In this way GDELT operates more similarly to an interpreter than a translator. This has not been a focal point of current machine translation research and required a highly iterative processing pipeline that breaks the translation process into quality stages and prioritizes the highest quality material, accepting that lower-quality material may have a lower-quality translation to stay within within the available time window.”

In addition, Leetaru wrote:

“Machine translation systems . . . do not ordinarily have knowledge of the user or use case their translation is intended for and thus can only produce a single ‘best’ translation that is a reasonable approximation of the source material for general use. . . . Using the equivalent of a dynamic language model, GDELT essentially iterates over all possible translations of a given sentence, weighting them both by traditional linguistic fidelity scores and by a secondary set of scores that evaluate how well each possible translation aligns with the specific language needed by GDELT’s Event and GKG systems.”

It will be interesting to see how and if usage of GDELT picks up with the broader, and richer, scope of content it now covers. With an increasingly complex international situation that runs the gamut from the climate change to terrorism, it seems like world leaders, policy experts and even business leaders could use all the information they can get about what’s connected to what, who’s connected to whom and how this all might play out.

[soundcloud url=”” params=”color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false” width=”100%” height=”166″ iframe=”true” /]

Apache Hive creators raise $13M for their Hadoop service, Qubole

Qubole, the Hadoop-as-a-service startup from Ashish Thusoo and Joydeep Sen Sarma, has raised a $13 million series B round of venture capital led by Norwest Ventures. Thusoo and Sen Sarma created the Apache Hive data warehouse framework for Hadoop while at Facebook several years ago, and launched Qubole in mid-2012. The company has now raised $20 million from investors.

Qubole is hosted on the Amazon Web Services cloud, but can also run on Google Compute Engine, and acts like one might expect a cloud-native Hadoop service to act. It has a graphical user interface, connectors to several common data sources (including cloud object stores), and it takes advantage of cloud capabilities such as autoscaling and spot pricing for compute. The company claims it processes 83 petabytes of data per month and that its customers used 4.96 million cloud compute hours in November.

What’s interesting about Qubole is that although it originally boasted optimized versions of Hive and other MapReduce-based tools, the company also lets users analyze data using the Facebook-created Presto SQL-on-Hadoop engine, and is working on a service around the increasingly popular and very fast Apache Spark framework.

Structure Data 2013 Ashish Thusoo Quobole

Ashish Thusoo at Structure Data 2013.

Qubole’s announcement follows that of a $30 million round for Altiscale on Wednesday and a $3 million round for a newer company called Xplenty in October.

In an interview about Altiscale’s funding, its founder and CEO, Raymie Stata, said his company most often runs up against Qubole and Treasure Data, and occasionally Xplenty, in customer deals. They’re all a little different in terms of capabilities, user experience and probably even target user, but they’re all much more fully featured and user-centric than Amazon Elastic MapReduce, which is the default Hadoop cloud service.

That space could be setting itself up for consolidation as investors keep putting money into it and bigger Hadoop vendors keep trying to bolster their cloud computing stories. Cloudera, Hortonworks, MapR, IBM, Pivotal, Oracle and the list goes on — they all see a future where more workloads will move to the cloud, but they’re all rooted in the software world. At some point they’re going to have to build up their cloud technologies and knowledge, or buy them.

Hadoop needs a better front-end for business users

Whether you’re running it on premises or in the cloud, Hadoop leaves a lot to be desired in the ease-of-use department. The Hadoop offerings on the three major cloud platforms (Amazon’s Elastic MapReduce  — EMR, Microsoft’s Azure HDInsight and Google Compute Engine’s Click-to-Deploy Hadoop) have their warts. And the three major on-premises distributions (Cloudera CDH, Hortonworks HDP and MapR) can be formidable adversaries to casual users as well.

See prompt

The root of Hadoop’s ease-of-use problem, no matter where you run it, is that it’s essentially a command line tool. In the enterprise, people are used to graphical user interfaces (GUIs), be they in desktop applications or in the Web browser, that make things fairly simple to select, configure, and run. To the highly technical people who were Hadoop’s early adopters, the minimalism of the command prompt has a greater purity and “honesty” than a GUI. But, while there’s no reason to demonize people who feel this way, command line tools just won’t fly with business users in the enterprise.

Amazon and Google seem to aim their services at well-initiated Hadoop jocks. And with that premise in place, their offerings are fine. But that premise isn’t a good one, frankly, if mainstream adoption is what these companies are looking for. Microsoft’s HDInsight does at least allow for simplified access to Hadoop data via a Hive GUI. This allows for entry of Hive queries (complete with syntax coloring), monitoring of job progress, and viewing of the query output. If you want to do more than that, you get to visit the magical land of PowerShell, Microsoft’s system scripting environment. Woo. Hoo.

Adjusting the Hue

Amazon now allows you to install “Hue,” an open source, browser-based GUI for Hadoop, on an EMR Hadoop cluster. Hue is probably the most evolved GUI offering out there on Hadoop. However, getting Hue working on EMR involves some security configuration in order to open up Web access to your cluster. And let’s just say that it’s far from straightforward.

Hue is available in the major on-premises Hadoop distributions as well. It provides front-ends for creating and running MapReduce jobs, working with Hive, Pig, HBase, the Solr search interface and more. But Hue’s Pig interface is really just a script editor with an run button, and its HBase interface is just a table browser. Whether on-site or in the cloud, Hadoop needs much more advanced tooling in order to get to groundswell status. Developers need to be able to connect to Hadoop from their integrated development environments (IDEs) and cloud control panels need to make the logistics of working with Hadoop far simpler.

“A” for effort

To its credit, Microsoft is making some important moves here. The preview release of its Visual Studio 2015 IDE includes both Hive integration into its Server Explorer tool, and “tooling to create Hive queries and submit them as jobs” according to a post on the Azure blog. It even now includes a browser for the Azure storage containers that back HDInsight clusters, and a special Hive project template.

Beyond that, the Web-based portal for HDInsight has a “Getting Started Gallery” section that provides task-based options for processing data and analyzing it in Excel. That is key: Hadoop will become truly mainstream only when people can use it as a means to an end. In order for that to happen, the tooling around it needs to be oriented to the workflows of business users (and data professionals), with the appropriate components invoked at the appropriate times. However, the HDInsight Getting Started Gallery items are more guided walkthroughs than they are truly automated tasks.

Hadoop at your service

Beyond the cloud providers’ own Hadoop products lie Hadoop-as-a-Service (yes, that would be HaaS) offerings from companies like Qubole and Altiscale, which let you work as if you had your own cluster at your beck and call. Qubole provides its own component-oriented GUI and API, and in turn deploys clusters for you on AWS, Google Compute Engine and, as of November 18th, Microsoft Azure as well. Altiscale runs its own cloud and provides its own workbench, which is a command line interface that users connect to via SSH.

HaaS offerings will likely grow in popularity. The incumbent cloud providers may need to adopt the as-a-service paradigm in their own Hadoop offerings, in terms of user interface and/or pricing models. Meanwhile, developer tools providers should integrate Hadoop into their IDEs as Microsoft has started to do. And everyone in the Hadoop industry needs to embed Hadoop components in their menu options and features, rather than just expose these components in a pass-through fashion.

Right now, Hadoop tools let you pick ingredients from the pantry, prepare them, and combine them on your own. In the end, with the right knowledge and skill, you do get a meal. But the tooling shouldn’t be just for chefs; it should be for diners too. Business users need to set up a project easily, configure its options, and then let the tools bring to bear the appropriate Hadoop components in the right combination. That’s how Hadoop will have a chance to make it as an enterprise main course.