Report: Apache Hadoop: Is one cluster enough?

Our library of 1700 research reports is available only to our subscribers. We occasionally release ones for our larger audience to benefit from. This is one such report. If you would like access to our entire library, please subscribe here. Subscribers will have access to our 2017 editorial calendar, archived reports and video coverage from our 2016 and 2017 events.
apache logo
Apache Hadoop: Is one cluster enough? by Paul Miller:
The open-source Apache Hadoop project continues its rapid evolution now and is capable of far more than its traditional use case of running a single MapReduce job on a single large volume of data. Projects like Apache YARN expand the types of workloads for which Hadoop is a viable and compelling solution, leading practitioners to think more creatively about the ways data is stored, processed, and made available for analysis.
Enthusiasm is growing in some quarters for the concept of a “data lake” — a single repository of data stored in the Hadoop Distributed File System (HDFS) and accessed by a number of applications for different purposes. Most of the prominent Hadoop vendors provide persuasive examples of this model at work but, unsurprisingly, the complexities of real-world deployment do not always neatly fit the idealized model of a single (huge) cluster working with a single (huge) data lake.
In this report we discuss some of the circumstances in which more complex requirements may exist, and explore a set of solutions emerging to address them.
To read the full report, click here.

Dell to build in-memory Hadoop appliances with Cloudera and Intel

Dell, Cloudera and Intel are working together on an appliance designed to speed the performance of Hadoop environments by moving a lot more data into a shared memory space. Key to the performance improvement is Apache Spark, the in-memory data-processing framework that’s now included in Cloudera’s Hadoop distribution. At this point, it seems like Hadoop vendors are going to sell their wares regardless where they run, so a deal like this really helps Dell make the case that hardware matters in big data environments. The companies claim it’s the first in a family of “Dell Engineered Systems for Cloudera Enterprise.”

Spark is a really big deal for big data, and Cloudera gets it

Cloudera has partnered with a startup called Databricks to integrate and support the Apache Spark data-processing platform within Cloudera’s Hadoop software. Spark, which is designed for speed and usability, is one of several technologies pushing Hadoop beyond MapReduce.

Today in Cloud

Over on The New York Times‘ Bits blog, Steve Lohr writes about a topic frequently covered here on GigaOM Pro; big data. He goes on to connect the processing of big data with the need for speed, and writes about the rise of in-memory processing solutions. IBM’s John Kelly is quoted as saying that in-memory processing will “break all of the technology we have,” leading to sweeping innovations in the way that computers need to be built. How far will this shift — if it occurs — go? Whilst big data-crunching appliances may well change, what about the computer on your desk… or the smartphone in your pocket?