It turns out that “big data” isn’t just a buzzword, but a legitimate concern for companies across the board. Their interest in the tools to take advantage of the opportunity for data analysis has sparked a land grab among software vendors centered around Hadoop.
During an afternoon panel entitled “The Many Faces of MapReduce — Hadoop and Beyond,” moderator Gary Orenstein compared the two primary Hadoop components — MapReduce and the Hadoop Distributed File System — to the meat and bread of a sandwich.
Cloud application-platform provider Appistry has teamed with Accenture to develop Cloud MapReduce product. Cloud MapReduce is focused on real-time analysis of streaming data, and it complements Appistry’s distributed file system to form a Hadoop alternative for certain applications.
I covered on the Structure blog today Yahoo’s decision to open source its S4 project, which the company calls “real-time MapReduce,” ostensibly because it relies upon Hadoop clusters for processing. I would argue this is actually quite a big deal considering the focus that vendors like IBM, SAP, Teradata and Oracle have been placing on real-time processing lately. So, here’s my question: If someone were to incorporate S4 into its Hadoop distribution or development platform, would it have a meaningful impact on the types of jobs organizations are thinking about when contemplating Hadoop?
Yahoo has open-sourced its S4 project for developing real-time MapReduce applications. As we’ve seen with Google’s new Caffeine infrastructure for its Instant Search features, there is a growing trend of unchaining large-scale data analysis from its batch-processing roots.
Aster Data, a San Carlos, Calif.-based company, is offering a free version of MapReduce development environment for downloads, which will allow developers to build data analytical apps based on it. MapReduce is a technology that was first used by Google for parallel processing of bigdata sets.
Hadoop, as a pivotal piece of the data mining renaissance, offers the ability to tackle large data sets in ways that weren’t previously feasible due to time and dollar constraints. But Hadoop can’t do everything quite yet, especially when it comes to real-time work flow. Fortunately, a couple of innovative efforts within the Hadoop ecosystem, such as Hypertable and HBase, are filling the gaps, while at the same time providing a glimpse as to where Hadoop’s full capabilities might be headed. Read More about Getting Closer to Real Time With Hadoop
A study released today by a team of leading database experts, among them Structure 09 speaker Michael Stonebraker, has been generating buzz for its assertion that clustered SQL database management systems (DBMS) actually perform significantly better for most tasks than does cloud golden child MapReduce. But how shocked should we be, really? After all, choosing a parallel data strategy is not an all-or-nothing proposition. Read More about MapReduce vs. SQL: It’s Not One or the Other
We’re now entering what I call the “Industrial Revolution of Data,” where the majority of data will be stamped out by machines: software logs, cameras, microphones, RFID readers, wireless sensor networks and so on. These machines generate data a lot faster than people can, and their production rates will grow exponentially with Moore’s Law. Storing this data is cheap, and it can be mined for valuable information.
In this context, there is some good news for parallel programming. Data analysis software parallelizes fairly naturally. In fact, software written in SQL has been running in parallel for more than 20 years. But with “Big Data” now becoming a reality, more programmers are interested in building programs on the parallel model — and they often find SQL an unfamiliar and restrictive way to wrangle data and write code. The biggest game-changer to come along is MapReduce, the parallel programming framework that has gained prominence thanks to its use at web search companies. Read More about Parallel Programming in the Age of Big Data
Things change fast in computer science, but odds are that they will change especially fast in the next few years. Much of this change centers on the shift toward parallel computing. In the short term, parallelism will take hold in massive datasets and analytics, but longer term, the shift to parallelism will impact all software, because most existing systems are ill-equipped to handle this new reality.
Like many changes in computer science, the rapid shift toward parallel computing is a function of technology trends in hardware. Most technology watchers are familiar with Moore’s Law, and the more general notion that computing performance doubles about every 18-24 months. This continues to hold for disk and RAM storage sizes, but a very different story has unfolded for CPUs in recent years, and it is changing the balance of power in computing — probably for good.
Read More about Programming a Parallel Future