Beyond Hadoop: Next-Generation Big Data Architectures

After 25 years of dominance, relational databases and SQL have in recent years come under fire from the growing “NoSQL movement.” A key element of this movement is Hadoop, the open-source clone of Google’s (s goog) internal MapReduce system. Whether it’s interpreted as “No SQL” or “Not Only SQL,” the message has been clear: If you have big data challenges, then your programming tool of choice should be Hadoop.

The only problem with this story is that the people who really do have cutting edge performance and scalability requirements today have already moved on from the Hadoop model. A few have moved back to SQL, but the much more significant trend is that, having come to realize the capabilities and limitations of MapReduce and Hadoop, a whole raft of radically new post-Hadoop architectures are now being developed that are, in most cases, orders of magnitude faster at scale than Hadoop. We are now at the start of a new “NoHadoop” era, as companies increasingly realize that big data requires “Not Only Hadoop.”

Simple batch processing tools like MapReduce and Hadoop are just not powerful enough in any one of the dimensions of the big data space that really matters. Sure, Hadoop is great for simple batch processing tasks that are “embarrassingly parallel”, but most of the difficult big data tasks confronting companies today are much more complex than that. They can involve complex joins, ACID requirements, real-time requirements, supercomputing algorithms, graph computing, interactive analysis, or the need for continuous incremental updates. In each case, Hadoop is unable to provide anything close to the levels of performance required. Fortunately, however, in each case there now exist next-generation big data architectures that can provide that required scale and performance. Over the next couple of years, these architectures will break out into the mainstream.

Here is a brief overview of the current NoHadoop or post-Hadoop space. In each case, the next-gen architecture beats MapReduce/Hadoop by anything from 10x to 10,000x in terms of performance at scale.

SQL. Having been around for 25 years, it’s a bit weird to call SQL next-gen, but it is! There’s currently a tremendous amount of innovation going on around SQL from companies like VoltDB, Clustrix and others. If you need to handle complex joins, or need ACID requirements, SQL is still the way to go. Applications: Complex business queries, online transaction processing.

Cloudscale. [McColl is the CEO of Cloudscale. See his bio below.] For realtime analytics on big data, it’s essential to break free from the constraints of batch processing. For example, if you’re looking to continuously analyze a stream of events at a rate of one million events per second per server, and deliver results with a maximum latency of five seconds between data in and analytics out, then you need a real-time data flow architecture. The Cloudscale architecture provides this kind of realtime big data analytics, with latency that is up to 10,000X faster than batch processing systems such as Hadoop. Applications: Algorithmic trading, fraud detection, mobile advertising, location services, marketing intelligence.

MPI and BSP. Many supercomputing applications require complex algorithms on big data, in which processors communicate directly at very high speed in order to deliver performance at scale. Parallel programming tools such as MPI and BSP are necessary for this kind of high performance supercomputing. Applications: Modelling and simulation, fluid dynamics.

Pregel. Need to analyse a complex social graph? Need to analyse the web? It’s not just big data, it’s big graphs! We’re rapidly moving to a world where the ability to analyse very-large-scale dynamic graphs (billions of nodes, trillions of edges) is becoming critical for some important applications. Google’s Pregel architecture uses a BSP model to enable highly efficient graph computing at enormous scale. Applications: Web algorithms, social graph algorithms, location graphs, learning and discovery, network optimisation, internet of things.

Dremel. Need to interact with web-scale data sets? Google’s Dremel architecture is designed to support interactive, ad hoc queries over trillion-row tables in seconds! It executes queries natively without translating them into MapReduce jobs. Dremel has been in production since 2006 and has thousands of users within Google. Applications: Data exploration, customer support, data center monitoring.

Percolator (Caffeine). If you need to incrementally update the analytics on a massive data set continuously, as Google now has to do on its index of the web, then an architecture like Percolator (Caffeine) beats Hadoop easily; Google Instant just wouldn’t be possible without it. “Because the index can be updated incrementally, the median document moves through Caffeine over 100 times faster than it moved through the company’s old MapReduce setup.” Applications: Real time search.

The fact that Hadoop is freely available to everyone means it will remain an important entry point to the world of big data for many people. However, as the performance demands for big data apps continue to increase, we will find these new, more powerful forms of big data architecture will be required in many cases.

Bill McColl is the founder and CEO of Cloudscale Inc. and a former professor of Computer Science, Head of the Parallel Computing Research Center, and Chairman of the Computer Science Faculty at Oxford University.

Related GigaOM Pro Research (sub req’d):