Report: Understanding the Power of Hadoop as a Service

Our library of 1700 research reports is available only to our subscribers. We occasionally release ones for our larger audience to benefit from. This is one such report. If you would like access to our entire library, please subscribe here. Subscribers will have access to our 2017 editorial calendar, archived reports and video coverage from our 2016 and 2017 events.
Understanding the Power of Hadoop as a Service by Paul Miller:
Across a wide range of industries from health care and financial services to manufacturing and retail, companies are realizing the value of analyzing data with Hadoop. With access to a Hadoop cluster, organizations are able to collect, analyze, and act on data at a scale and price point that earlier data-analysis solutions typically cannot match.
While some have the skill, the will, and the need to build, operate, and maintain large Hadoop clusters of their own, a growing number of Hadoop’s prospective users are choosing not to make sustained investments in developing an in-house capability. An almost bewildering range of hosted solutions is now available to them, all described in some quarters as Hadoop as a Service (HaaS). These range from relatively simple cloud-based Hadoop offerings by Infrastructure-as-a-Service (IaaS) cloud providers including Amazon, Microsoft, and Rackspace through to highly customized solutions managed on an ongoing basis by service providers like CSC and CenturyLink. Startups such as Altiscale are completely focused on running Hadoop for their customers. As they do not need to worry about the impact on other applications, they are able to optimize hardware, software, and processes in order to get the best performance from Hadoop.
In this report we explore a number of the ways in which Hadoop can be deployed, and we discuss the choices to be made in selecting the best approach for meeting different sets of requirements.
To read the full report, click here.

AWS suits up more enterprise perks

More AWS perks for business users

Amazon Web Services has beefed up its identity management and access control capabilities so that businesses can more easily apply permissions to users, groups and roles in a consistent way. As explained in a blog post,  these identity and access management (IAM) policies are now treated as “first-class AWS objects” so that they can be created, named, and attached to one or more IAM users, groups, or roles.

Since I was unclear about what a first-class AWS Object really is I reached out to someone who knows who said that these policies get their own unique Amazon Resource Name (ARN). And that, in turn means users can more easily reuse common managed policies without having to write,update and maintain permissions.

These managed policies can also be managed centrally and applied across IAM entities — the aforementioned users, groups, or roles. And, customers can subscribe to shared AWS Managed Policies, so that its easier for them to appy best security or other practcies.


That news came a few days after [company]Amazon[/company] announced general availability of its AWS Config, a configuration management database (CMDB) tool, announced in November, that keeps track of the cloud resources used and the connections between them. The goal is that it can then track changes made to those resources and make sure those changes are logged in AWS CloudTrail.  The data collected there can then be polled via Amazon’s own APIs

AWS Config, and AWS Service Catalog, were both announced in preview form AWS re:Invent in November. A Service Catalog is a tool used in enterprise accounts to shop for and manage authorized tools and applications and will be tied into IAM.  General availability for Service Catalog was promised for early 2015, so stay tuned.

All of these services — promised and delivered — are geared to make AWS more IT friendly in bigger enterprises — to help make sure that users can access only the resources they are authorized for and that those resources are the most updated versions.

It’s also interesting that AWS, which used to announce new services only when they were ready, is now fully in enterprise software mode, pre-announcing new products weeks and months before they are broadly available.


AWS Re:invent

AWS Re:invent

EMC Cloudscaling aims to bridge OpenStack-AWS divide

If you’re running an OpenStack private cloud and want it to talk to Amazon’s EC2 compute service, you may want to check out this a new “drop-in”API created by EMC/Cloudscaling and available from Stackforge.

Randy Bias, co-founder  of Cloudscaling and now VP of Technology for [company]EMC[/company], has long maintained that OpenStack needs to work with Amazon. He also pledged similar support for [company]Google[/company] Compute Engine APIs. Asked via email if that’s still the plan, Bias  said “yes but it’s a lower priority until we see traction.”

Structure Podcast: The biologic roots of deep learning

Deep learning, which enables a computer to learn — or program itself — to solve problems — is a hot topic that Enlitic CEO Jeremy Howard and Senior Data Scientist Ahna Girshick helped explain to mere mortals on this week’s podcast.   If you want to know why you don’t necessarily need a ton of data to do good work in deep learning and how the field is inspired by biology, if not the human brain,  check out this show. And, to hear more from Gisrshnick on this hot topic, you can also sign up for next month’s Structure Data event.

[soundcloud url=”″ params=”secret_token=s-lutIw&color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false” width=”100%” height=”166″ iframe=”true” /]


Hosts: Barb Darrow and Derrick Harris.

Download This Episode

Subscribe in iTunes

The Structure Show RSS Feed


This story was updated at 11:37 a.m. PST February 18 with more detail on what an AWS First-Class is.

You can now automatically recover instances for Amazon EC2

In yet another sign that the big cloud players of Google, Amazon and Microsoft have moved on from just storage price cuts are in the midst of a feature war, Amazon laid out the details of a new auto recovery tool for Amazon EC2, the company explained in a blog post on Monday.

The tool claims to make it possible for EC2 instances to automatically spin up when internal system checks discover that something is hampering those instances. These problems could include a “loss of network connectivity, loss of system power, software issues on the physical host, and hardware issues on the physical host,” the blog post states.

Now, when a hardware issue impedes one of your EC2 instances, you can automatically reboot that instance, which should contain all the necessary configuration details like the instance ID and the IP address. The new instance can also be rebooted onto new hardware if the situation warrants.

Amazon auto recovery

Amazon auto recovery

The auto recovery feature is currently only available to users running the C3, C4, M3, R3, and T2 instance types in the AWS US East region, but [company]Amazon[/company] plans to “make it available in other regions as quickly as possible.” Users can set parameters and alerts in CloudWatch to enable auto recovery. There is no extra charge for auto recovery but regular charges for CloudWatch apply.

Currently, it doesn’t seem like [company]Microsoft[/company]’s Azure has a similar feature. I asked Microsoft and will update this post if I hear back.

[company]Google[/company] Cloud has Live Migration, which means it doesn’t require auto recovery, according to Scalr founder Sebastian Stadil. However, he sees Amazon’s auto recovery being useful to remedy potential software issues.

“Advanced cloud users have been tying their monitoring to fault recovery for a while now, using either homegrown software or off the shelf software like Scalr/RightScale, and they can now use Amazon as an additional choice,” he wrote in an email.

Barb Darrow contributed to this report.

Update on Friday, Jan 23:
A Microsoft spokesperson sent me this blog post:
[blockquote person=”Microsoft” attribution=”Microsoft”]In addition to platform updates, Microsoft Azure service healing occurs automatically when Microsoft Azure detects problematic nodes and moves these virtual machines (VMs) to new nodes. When this occurs, you lose connectivity to VM during the service healing process and after the service healing process is completed, when you connect to VM, you will likely find an event log entry indicating VM restart/shutdown (either gracefully or unexpected).[/blockquote]

Big, new AWS C4 instances are here — for real

Amazon Web Services pre-announced big, new C4 compute instances in November at AWS Re:invent, apparently re-announced them on January 2 in a disappearing blog post, and on Monday announced availability officially (with the same blog post). Talk about recycling.

In November, [company]Intel[/company] SVP and GM Diane Bryant helped [company]Amazon[/company] CTO Werner Vogels introduce the instances, which are based on Intel’s Haswell chip with a base speed of 2.9 GHz and achievable clock speeds of up to 3.5 GHz.

The latest additions to the AWS’s EC2 menu, the instances could be a boon for compute-intensive workloads in online gaming, risk analysis or graphics rendering.

This is the latest shot fired in the compute instance arms race. Last week, [company]Microsoft[/company] announced availability of its big G4 VMs (Microsoft speak for instances).

In case you’ve forgotten, here are the various now-available instances with pricing for U.S. East and West (Oregon) regions. (The instances are also available in other AWS regions, but prices may vary.)

aws c4 instance pricing

Why AWS Lambda is a Masterstroke from Amazon

Amazon launched AWS Lambda at its re:Invent conference in November 2014, and though there were over half a dozen other cloud services also announced, Lambda stood out as the most innovative and unique. The service runs snippets of JavaScript code in response to events generated by data services like Amazon S3, Amazon Kinesis, and Amazon DynamoDB. Think of it as a sandwich service that sits in between data sources and the compute layer, abstracting microservices at a higher level than virtualization and containerization. Gigaom Research’s recent CloudTracker report named Lambda as the Disruptive Cloud Technology of the fourth quarter of 2014.

The timing of Lambda’s launch could not be better. AWS stepped ahead of the curve with the product when the entire industry was agog over container technology, its impact on public cloud providers, and the increased competition from Microsoft and Google on core compute, storage, and database services. And Lambda might initially appear to be yet-another cloud service exposing compute, but as the following sections illustrate, it is definitely much more than that.

AWS Lambda Functional Microservices Abstraction Layer

Screen Shot 2015-01-08 at 3.46.31 PM

Source: Gigaom Research

AWS Lambda offers the perfect middle ground  between IaaS and PaaS. It also effectively counters the growing threat of containers to its business by simplifying the task of running code in the cloud. It’s Amazon’s way of delivering a microservices framework far ahead of its competitors.

The Architecture of AWS Lambda 

Lambda is the latest addition to Amazon’s compute service. A simple invocation of os.platform() and os.release() methods within a Lambda function prove that it runs on Amazon Linux AMI (Version 3.14.26–24.46.amzn1.x86_64 to be precise). It is powered by Node.js running V8 JavaScript engine. Each JavaScript snippet is associated with a specific identity defined in IAM with permissions to invoke it. There is another role that assumes access to the event source. Associating these two roles with a Lambda function completes the execution loop. Developers can define the maximum timeout of the function that ranges from 1 second to 60 seconds. Memory can be allocated in the increments of 64MB anywhere between 128MB to 1GB. When a data source raises an event, the details are passed to the Lambda function as a parameter. This opens up many interesting opportunities for developers to perform a variety of tasks including using packages and running Node.js modules.

Lambda functions are stateless, allowing them to scale rapidly. Depending on the speed at which the events are raised, the runtime can decide to run multiple Lambda function copies concurrently. Another important aspect of Lambda is that the functions cannot be directly exposed to the outside world. Other than the supported sources, it is impossible to invoke Lambda functions. That means the code cannot be exposed at REST endpoints directly. But a Lambda function can invoke other services making outbound calls. This makes it fundamentally different from PaaS and container environments.

Currently, there are three ways of running code in AWS cloud: Amazon EC2, Amazon ECS, and AWS Elastic Beanstalk. EC2 is a full-blown IaaS while ECS is the hosted container environment. Finally, Elastic Beanstalk is a PaaS layer. AWS Lambda forms the fourth service with the capability to execute code in the cloud. But it’s unique in a sense that it is at the intersection of EC2, ECS, and Elastic Beanstalk.

AWS Lambda with EC2, EC2 Container Services, and Elastic Beanstalk

Screen Shot 2015-01-08 at 3.51.04 PM

While it is clear that the Lambda execution environment runs on Amazon EC2, AWS does not disclose how snippets are isolated from each other. To achieve massive scale and strong isolation, AWS could be running containers to host Lambda functions. So it is a microservices environment at the highest level of abstraction.

Let’s compare Lambda with other AWS compute services:

Lambda versus EC2

With Amazon EC2, developers need to spin up VMs and install the right software stack before uploading and running code. This involves dealing with provisioning, configuration, monitoring, managing, and maintaining VMs throughout the application lifecycle. There are many ongoing DevOps-related tasks involved from the time the VM is provisioned. However, this gives maximum control to administrators since they are in the driver’s seat. AWS Lambda doesn’t deal with the VMs at all. It’s just plain code uploaded or written in-line within the browser-based editor. DevOps can never SSH into the Linux VM running Lambda. It can only monitor logs and timeout exceptions raised at runtime.

Lambda versus the EC2 Container Service

With Amazon EC2 Container Service, the focus shifts from VMs to containers. Developers and operators need to move code to appropriate containers provisioned on EC2. They need to create container images locally and upload them to the hub, which will be provisioned by ECS at a later point. Scheduling and orchestrating the containers is left to ECS service. But, DevOps teams still need to manage the container’s lifecycle. Effectively, DevOps owns container and code and ECS only provides runtime execution services. Though Lambda may be running inside containers, DevOps never has to deal with it. Except capturing the logs, there is no container maintenance required for Lambda. The containers responsible for hosting Lambda are not accessible to the outside world. AWS manages the runtime dynamically by creating and terminating containers on the fly.

Lambda versus Elastic Beanstalk

Since Amazon Elastic Beanstalk is a PaaS layer, developers push the code along with the metadata. The metadata contains the details of the AMI, language, framework, and runtime requirements along with connection information of databases or dependencies. Based on the metadata, AWS Elastic Beanstalk launches an appropriate AMI and configures it to run the code. Similar to other PaaS offerings, developers push the code and configuration to Elastic Beanstalk. The configuration or metadata can be simple or comprehensive depending on the application architecture. Lambda is much simpler than PaaS. It just expects the code and its association with a set of IAM roles. Of course, allocating RAM and defining the timeout may be considered as the configuration and metadata but they are much simpler and consistent across any Lambda function. Most of the code running within PaaS is exposed to the outside world as a Web page or REST endpoint. But Lambda functions are inaccessible from the public Internet. They need to be invoked only through the supported data sources.

Thanks to Docker and containers, microservices are becoming popular. AWS Lambda is one of the first microservices environment on the public cloud. Its innovative pricing model based on the number of requests, execution time, and allocated memory makes it very attractive for moving parts of web-scale applications. When AWS adds additional languages like Ruby, Python, and Java and brings support for EC2, CloudTrail, RDS and other custom event sources, Lambda’s power will exponentially grow. It has the potential to become the focal point of AWS cloud.

Looks like AWS is ready to roll out spiffy new Intel C4 instances

New, muscular Intel Haswell-based C4 instances are now available to the masses — or will be soon– depending on whether an Amazon Web Services blog post that went up, then came down, on Friday morning is accurate.

From the post:

The new C4 instances are based on the Intel Xeon E5-2666 v3 (code name Haswell) processor. This custom processor, optimized for EC2, runs at a base speed of 2.9 GHz, and can achieve clock speeds as high as 3.5 GHz with Turbo Boost. These instances are designed to deliver the highest level of processor performance on EC2.

The new EC2 lineup  was “pre-announced” by Intel GM and VP Diane Bryant at AWS Re:invent in November. At that time, however, neither Bryant nor Amazon CTO Werner Vogels provided much detail about the C4 instances, saying only that [company]Intel[/company] and [company]Amazon[/company] collaborated on design to make sure the chip was optimized for AWS use. I’ve asked AWS for comment and will update this as needed. The pricing disclosed on the blog post is as follows:


aws c4 instance pricing

The fact that [company]Amazon[/company], which used to announce things only when they were ready to roll, has taken to pre-announcing products, shows just how much the company has evolved into an IT provider in the mold of [company]Microsoft[/company], which was famous for introducing products early. Of course, the lag time in cloud is a fraction of what it was in the client-server era, so an announcement in mid-November for delivery in early January — if the disappearing post is accurate — really isn’t so bad.