Report: Apache Hadoop: Is one cluster enough?

Our library of 1700 research reports is available only to our subscribers. We occasionally release ones for our larger audience to benefit from. This is one such report. If you would like access to our entire library, please subscribe here. Subscribers will have access to our 2017 editorial calendar, archived reports and video coverage from our 2016 and 2017 events.
apache logo
Apache Hadoop: Is one cluster enough? by Paul Miller:
The open-source Apache Hadoop project continues its rapid evolution now and is capable of far more than its traditional use case of running a single MapReduce job on a single large volume of data. Projects like Apache YARN expand the types of workloads for which Hadoop is a viable and compelling solution, leading practitioners to think more creatively about the ways data is stored, processed, and made available for analysis.
Enthusiasm is growing in some quarters for the concept of a “data lake” — a single repository of data stored in the Hadoop Distributed File System (HDFS) and accessed by a number of applications for different purposes. Most of the prominent Hadoop vendors provide persuasive examples of this model at work but, unsurprisingly, the complexities of real-world deployment do not always neatly fit the idealized model of a single (huge) cluster working with a single (huge) data lake.
In this report we discuss some of the circumstances in which more complex requirements may exist, and explore a set of solutions emerging to address them.
To read the full report, click here.

Quantcast releases bigger, faster, stronger Hadoop file system

It’s not for everyone, but if you’re storing petabytes of data Hadoop, Quantcast thinks it has the cure to your woes. Its newly open sourced Quantcast File System promises smaller clusters and better performance, and it has proven itself over exabytes of data inside Quantcast.

How 0xdata wants to help everyone become data scientists

Although it’s still a work in progress, 0xdata thinks it has the answer to the problem of doing advanced statistical analysis at scale: Build on HDFS for scale, use the widely known R programming language and hide it all under a simple interface.

Because Hadoop isn’t perfect: 8 ways to replace HDFS

Hadoop is on its way to becomig the de facto platform for the next-generation of data-based applications, but it’s not without some flaws. Ironically, one of Hadoop’s biggest shortcomings right now is also one of its biggest strengths going forward — the Hadoop Distributed File System.

How Facebook keeps 100 petabytes of Hadoop data online

It’s no secret that Facebook stores a lot of data in Hadoop, but how it keeps that data available whenever it needs it isn’t necessarily common knowledge. Today at the Hadoop Summit Facebook Engineer Andrew Ryan highlighted that solution, which Facebook calls AvatarNode.

What it really means when someone says ‘Hadoop’

Hadoop features front and center in the discussion of how to implement a big data strategy, one of the biggest trends in IT. There’s just one problem that keeps cropping up: many people don’t seem to know exactly what it means when somebody says “Hadoop.”

NetApp does network-attached Hadoop

Targeting customers demanding more-reliable and efficient Hadoop clusters to power their big data efforts, NetApp has partnered with Cloudera to deliver a Hadoop storage system called the NetApp Open Solution for Hadoop. It combines Cloudera’s Hadoop distribution and management software with a NetApp-built RAID architecture.