Report: Bringing Hadoop to the mainframe

Our library of 1700 research reports is available only to our subscribers. We occasionally release ones for our larger audience to benefit from. This is one such report. If you would like access to our entire library, please subscribe here. Subscribers will have access to our 2017 editorial calendar, archived reports and video coverage from our 2016 and 2017 events.
Bringing Hadoop to the mainframe by Paul Miller:
According to market leader IBM, there is still plenty of work for mainframe computers to do. Indeed, the company frequently cites figures indicating that 60 percent or more of global enterprise transactions are currently undertaken on mainframes built by IBM and remaining competitors such as Bull, Fujitsu, Hitachi, and Unisys. The figures suggest that a wealth of data is stored and processed on these machines, but as businesses around the world increasingly turn to clusters of commodity servers running Hadoop to analyze the bulk of their data, the cost and time typically involved in extracting data from mainframe-based applications becomes a cause for concern.
By finding more-effective ways to bring mainframe-hosted data and Hadoop-powered analysis closer together, the mainframe-using enterprise stands to benefit from both its existing investment in mainframe infrastructure and the speed and cost-effectiveness of modern data analytics, without necessarily resorting to relatively slow and resource-expensive extract transform load (ETL) processes to endlessly move data back and forth between discrete systems.
To read the full report, click here.

Report: Understanding the Power of Hadoop as a Service

Our library of 1700 research reports is available only to our subscribers. We occasionally release ones for our larger audience to benefit from. This is one such report. If you would like access to our entire library, please subscribe here. Subscribers will have access to our 2017 editorial calendar, archived reports and video coverage from our 2016 and 2017 events.
Understanding the Power of Hadoop as a Service by Paul Miller:
Across a wide range of industries from health care and financial services to manufacturing and retail, companies are realizing the value of analyzing data with Hadoop. With access to a Hadoop cluster, organizations are able to collect, analyze, and act on data at a scale and price point that earlier data-analysis solutions typically cannot match.
While some have the skill, the will, and the need to build, operate, and maintain large Hadoop clusters of their own, a growing number of Hadoop’s prospective users are choosing not to make sustained investments in developing an in-house capability. An almost bewildering range of hosted solutions is now available to them, all described in some quarters as Hadoop as a Service (HaaS). These range from relatively simple cloud-based Hadoop offerings by Infrastructure-as-a-Service (IaaS) cloud providers including Amazon, Microsoft, and Rackspace through to highly customized solutions managed on an ongoing basis by service providers like CSC and CenturyLink. Startups such as Altiscale are completely focused on running Hadoop for their customers. As they do not need to worry about the impact on other applications, they are able to optimize hardware, software, and processes in order to get the best performance from Hadoop.
In this report we explore a number of the ways in which Hadoop can be deployed, and we discuss the choices to be made in selecting the best approach for meeting different sets of requirements.
To read the full report, click here.

Report: Extending Hadoop Towards the Data Lake

Our library of 1700 research reports is available only to our subscribers. We occasionally release ones for our larger audience to benefit from. This is one such report. If you would like access to our entire library, please subscribe here. Subscribers will have access to our 2017 editorial calendar, archived reports and video coverage from our 2016 and 2017 events.
Extending Hadoop Towards the Data Lake by Paul Miller:
The data lake has increasingly become an aspect of Hadoop’s appeal. Referred to in some contexts as an “enterprise data hub,” it now garners interest not only from Hadoop’s existing adopters but also from a far broader set of potential beneficiaries. It is the vision of a single, comprehensive pool of data, managed by Hadoop and accessed as required by diverse applications such as Spark, Storm, and Hive, that offers opportunities to reduce duplication of data, increase efficiency, and create an environment in which data from very different sources can meaningfully be analyzed together.
Fully embracing the opportunity promised by a comprehensive data lake requires a shift in attitude and careful integration with the existing systems and workflows that Hadoop often augments rather than replaces. Existing enterprise concerns about governance and security will certainly not disappear, so suitable workflows must be developed to safeguard data while making it available for newly feasible forms of analysis.
Early adopters in a range of industries are already finding ways to exploit the potential of their data lakes, operationalizing internal analytic processes and integrating rich real-time analyses with more established batch processing tasks. They are integrating Hadoop into existing organizational workflows and addressing challenges around the completeness, cleanliness, validity, and protection of their data.
In this report, we explore a number of the key issues frequently identified as significant in these successful implementations of a data lake.
To read the full report, click here.

Meet Myriad, a new project for running Hadoop on Mesos

Hadoop vendor MapR and data center automation startup Mesosphere have created an open source technology called Myriad, which is supposed to make it easier to run Hadoop workloads on top of the popular Mesos cluster-management software. More specifically, Myriad allows the YARN resource scheduler — the linchpin of Hadoop 2.0 that lets the platform run processing frameworks other than MapReduce — to run on top of Mesos, effectively creating an auto-scaling Hadoop cluster that’s relatively future-proof.

“Before, you had to make a choice, and now you can just run YARN on Mesos,” explained Mesosphere founder and CEO Florian Leibert. “… I think the goal here is to have more workloads in a shared environment.”

What he means is that companies will no longer have to run Hadoop on one set of resources, while running the web servers, Spark and any other number of workloads on other resources managed by Mesos. Essentially, all of these things will now be available as data center services residing on the same set of machines. Mesos has always supported Hadoop as a workload type — and companies including Twitter and Airbnb have taken advantage of this — but YARN has appeal as the default resource manager for newer distributions of Hadoop because it’s designed specifically for that platform and, well, is one of the foundations of those newer distributions.

The old static partition.

The old static partition.

With Myriad, YARN can still manage the resource allocation to Hadoop jobs, while Mesos handles other tasks as well as the task of scaling out the YARN cluster itself. So instead of the current state of affairs, where YARN clusters are statically defined and new nodes must be manually configured, Mesos can spin up new YARN nodes automatically based on the policies in place and the available resources of the cluster.

Mesosphere engineer Adam Bordelon said Myriad works now and that eBay and Twitter have been testing it out. eBay actually contributed quite a lot to the first version of the code. However, he noted, Myriad still early in its development and needs quite a few more features, including around security.

“I imagine within a month or two,” he said, “it should be in production somewhere.”

Despite the fact that two commercial companies are driving Myriad at this point, Bordelon said the goal is definitely to build a community around the project. It’s currently hosted in the Mesosphere GitHub repository, but the team is currently working on a proposal to make it an Apache Incubator project.

“It is definitely a community effort,” he said.

The new YARN-on-Mesos architecture.

The new YARN-on-Mesos architecture.

Jim Scott, MapR’s director of enterprise strategy and architecture, said that Hadoop was pitched in part as a tool for eliminating data silos. However, he added, “As we start see those data silo walls come down, we’re starting to see other walls come up.” One of those walls is the relegation of Hadoop to its own dedicated cluster far away, logically at least, from everything else.

“This is the enabling function, in my mind,” he said, “that makes it so people can tear that wall down.”

MapR CEO John Schroeder will be among many speakers talking about the evolution of Hadoop and big data architectures at our Structure Data conference in New York next month. Others include Cloudera CEO Tom Reilly, Hortonworks CEO Rob Bearden, Google VP of Infrastructure Eric Brewer, Databricks CEO Ion Stoica and Amazon Web Services GM of Data Science Matt Wood.

And for more on Mesos, Mesosphere and why they have some engineers so excited, check out our May 2014 Structure Show podcast interview with Mesosphere CEO Leibert.

[soundcloud url=”″ params=”color=ff5500&auto_play=false&hide_related=false&show_artwork=true” width=”100%” height=”166″ iframe=”true” /]

Download This Episode 

Subscribe in iTunes

The Structure Show RSS Feed

Datameer 5, ExtraHop 4: start your (choice of) engines

I’ve written much in this space, and on the Gigaom blog, about YARN and its impact on how Hadoop will be used. Not very long ago, Hadoop was synonymous with the MapReduce algorithm and non-interactive batch processing. With YARN, Hadoop has morphed into a general purpose distributed processing and distributed storage system. To that, we may say “bully for Hadoop!” But what about tools built on top of it?

One such tool, Datameer, from the company of the same name, implements a full-fledged data discovery environment on top of Hadoop. Gigaom Research’s Sector Roadmap on Data Discovery tools, published in March of this year, and authored by me, featured Datameer as the highest-scoring product of the six that were evaluated with a numeric score. Datameer 4.0 was dependent on MapReduce, and that made its strong showing somewhat counterintuitive. But the disruption vectors in the report supported the outcome.

Get Smart
My own hope was that a future version of Datameer would embrace Hadoop 2.0 and YARN as strongly as the then current version embraced Hadoop 1.0 and MapReduce. Today, Datameer is announcing its 5.0 release and, along with it, a new feature called Smart Execution that exploits YARN’s ability to bring multiple execution engines to bear on the same data.

For those who want a few gory details, Smart Execution is directed acyclic graph (DAG)-based, selecting one of a few execution environments for each section of an analysis request. Sometimes, it will perform its work in-memory on a single compute node, to avoid network bottlenecks. At other times it will parallelize, using a combination of Tez and YARN in certain cases, and good old MapReduce (albeit on YARN) in others. I call these gory details because as a user, you needn’t worry about them. You just give Datameer your data and your analysis tasks, and it will pick the right engine for the job.

Datameer’s future…and Hadoop’s
Datameer says the architecture it’s used for Smart Execution makes it open and pluggable such that other execution engines can be integrated in the future. It calls Apache Spark out by name as a candidate for such future integration, and it also allows for the possibility of integrating engines that don’t yet exist. By taking this approach, Datameer is playing to Hadoop and YARN’s strengths. By abstracting this technology, and relieving users of the burden of understanding engine selection, Datameer is utilizing Hadoop as the embedded engine it should be.

At this point, it seems that most Hadoop-related engines are being reengineered to run on top of YARN. One such engine is Apache Storm, designed to handle processing of streaming data. Storm’s growth in popularity has been on a huge incline of late, and yet it has remained in incubation status at the Apache Software Foundation (ASF). That changed on Monday, when the ASF announced that Storm had graduated to top-level project status. Perhaps Datameer v.Next will take on streaming data, and perhaps it will add Apache Storm to its arsenal of YARN-compatible execution engines.

As is usually the case in this industry and others, standards are the most effective force for change. Hadoop is a standard, and so now is YARN. Through the efforts of a team at Yahoo, Storm now runs on the YARN standard. That virtuous cycle is likely to continue.

JSON takes an ExtraHop
Another standard that has effected massive change in the computing industry is that of JavaScript Object Notation (JSON). JSON is the cornerstone of document store databases like MongoDB and CouchDB. It’s also a standard for encoding data that is at this point second nature to most Web developers.

JSON is becoming second nature to a growing number of companies, too. The folks at ExtraHop have a very cool (eponymous) streaming data platform that can extract data directly from network data packets and run analytics on it. Up until now, however, that data would get stored in a proprietary data store that was limited to a 30-day look-back period. Analytics on raw data off the wire is extremely valuable, but less so when non-trivial constraints are placed on storage of that data

But ExtraHop 4.0 was released on Tuesday and, with it, those constraints have disappeared. First, ExtraHop’s native data store has been enhanced to store much larger volumes of data. But second, ExtraHop 4.0 now implements an Open Data Stream that makes all of its data available in JSON format, and thus readily stored in other databases, including MongoDB.

What about Oracle?
JSON compatibility isn’t limited to NoSQL databases like MongoDB. In fact, Oracle 12c can handle JSON natively now. This and other facets of Oracle’s data stack have been proudly trumpeted at Oracle OpenWorld, which has been going on this week in San Francisco. Gigaom Research’s own George Gilbert published an OpenWorld pre-game post on Friday. Expect another post from George shortly on Oracle’s JSON chops, what it means for the juggernaut relational database and what it means for its namesake company, now under new management.

Hadoop gets more versatile, but data is still king

Hadoop, the open source big-data framework, has gradually evolved from being a shiny object in the laboratory to an increasingly serious tool finding its place in the Enterprise. At Gigaom, we’ve covered Hadoop’s increasing maturity, and completeness as an enterprise technology, because that’s been the story. But now the story is changing.

The change emanates from the release of Hadoop 2.0 and the rapid standardization on that release’s new resource management engine, YARN. YARN moves Hadoop out of the world of batch processing, to the interactive world. We’ve covered that too. But while the change in the story builds on that facet of YARN, it pivots rather dramatically.

It’s a YARN world, we’re just computing in it

Hadoop used to be a self-contained product. Now it’s a platform – a stage that showcases a number of other products. Whether Spark or Storm or Cascading orDrill or whatever the next big thing in open source project land is, it’s going to run on YARN, it’s going to use HDFS storage and it’s probably going to gather momentum and broad acceptance pretty quickly.

That’s a pretty radical change. It moves Hadoop down the stack. It converts Hadoop from being an exposed technology that’s a developer destination to an embedded technology that will become a developer service. Suddenly, Hadoop’s role is quite similar to that of a runtime, or even of an operating system. And once that happens, Hadoop needn’t be a Big Data tool exclusively. Instead, it can just become a general-purpose platform for distributed computing and distributed storage. Hadoop can be a workflow engine, it can be a service bus, it can be an object broker, it can be a content management system.

From BD to SI?

If Hadoop does become all those things, what will that mean for Hadoop distribution vendors, like  Cloudera, Hortonworks and MapR? Will they still be big data companies, or will they become OS vendors that double as systems integrators for distributed computing implementations?

I suppose you could argue that in fact these companies already are integrators: a big part of their job is picking and choosing what Hadoop-affiliated components to include in their distros and making sure they all work together. Since all of these components are open source, customers could do this work themselves. But if the Hadoop vendors do it for them, then they provide quite a valuable integration service.

Identity crisis

So if Hadoop won’t be a big data tool anymore and Hadoop vendors won’t be big data companies, then what’s really going on? Where’s the equilibrium? Where’s the center of gravity in the Hadoop ecosystem? Is Hadoop really just about an abstraction layer for distributed processing, and a bunch of grease monkey work to hook it all together? Is the Hadoop space careening toward a bland and agnostic world of general purpose computing?

It’s true that a big part of Hadoop’s success is that it essentially provides for supercomputing and robust storage on commodity hardware, rather than proprietary appliances requiring military-level spending. So the lure of Hadoop as a general-purpose computing medium is strong, because it disrupts a pretty expensive status quo in enterprise infrastructure.

Take it back home

But for Hadoop, data is still where it’s at. Data is Hadoop’s center of gravity. Data is the thread that ties the motley crew of open source projects together. Data is still the motivator, and data is still the prize. Because when you have a general purpose compute and storage framework with compelling economics, then also you have a place where data accumulates, even when it’s merely a byproduct of what looks like a non-data-focused application.

Hadoop is a data collecting point, and it’s also a place where code that analyzes all of that collected data can run really well. That’s critical, because data is the lifeblood of all business software. In fact, as even Hadoop 1.0 proved, data is also the lifeblood of business itself.

When you have a compute and storage framework that embraces data’s primacy – that celebrates it, obsesses over it, exploits it, shares it and facilitates a virtuous cycle around it – then you have a winner. YARN has facilitated that win. That’s the new story.

MapR now supports YARN, puts HP Vertica on top of Hadoop

MapR is continuing along its path to Hadoop glory with new support for the YARN resource manager and a direct integration with the HP Vertica analytic database. In such a competitive space, every little edge matters.