Just a few weeks ago, Apache Hadoop 2.0 was declared generally available–a huge milestone for the Hadoop market as it unlocks the vision of interacting with stored data in unprecedented ways. Hadoop remains the typical underpinning technology of “Big Data,” but how does it fit into the current landscape of databases and data warehouses that are already in use? And are there typical usage patterns that can be used to distill some of the inherent complexity for us all to speak a common language?
Common patterns of Hadoop use
Hadoop was originally conceived to solve the problem of storing huge quantities of data at a very low cost for companies like Yahoo, Google, Facebook and others. Now, it is increasingly being introduced into enterprise environments to handle new classes of data. Machine-generated data, sensor data, social data, web logs and other such types are growing exponentially, but also often (but not always) unstructured in nature. It is this type of data that is turning the conversation from “data analytics” to “big data analytics”: because so much insight can be gleaned for business advantage.
Analytic applications come in all shapes and sizes–and most importantly, are oriented around addressing a particular vertical need. At first glance, they can seem to have little relation to each other across industries and verticals. But in reality, when observed at the infrastructure level, some very clear patterns emerge: they can fit into one of the following three patterns.
Pattern 1: Data refinery
The “Data Refinery” pattern of Hadoop usage is about enabling organizations to incorporate these new data sources into their commonly used BI or analytic applications. For example, I might have an application that provides me a view of my customer based on all the data about them in my ERP and CRM systems, but how can I incorporate data from their web sessions on my website to see what they are interested in? The “Data Refinery” usage pattern is what customers typically look to.
The key concept here is that Hadoop is being used to distill large quantities of data into something more manageable. And then that resulting data is loaded into the existing data systems to be accessed by traditional tools–but with a much richer data set. In some respects, this is the simplest of all the use cases in that it provides a clear path to value for Hadoop with really very little disruption to the traditional approach. No matter the vertical, the refinery concept applies. In financial services, we see organizations refine trade data to better understand markets or to analyze and value complex portfolios. Energy companies use big data to analyze consumption over geography to better predict production levels. Retail firms (and virtually any consumer-facing organization) often use the refinery to gain insight into online sentiment. Telecoms are using the refinery to extract details from call data records to optimize billing. Finally, in any vertical where we find expensive, mission critical equipment, we often find Hadoop being used for predictive analytics and proactive failure identification. In communications, this may be a network of cell towers. A restaurant franchise may monitor refrigerator data.
Pattern 2: Data exploration with Apache Hadoop
The second most common use case is one we call “Data Exploration.” In this case, organizations capture and store a large quantity of this new data (sometimes referred to as a data lake) in Hadoop and then explore that data directly. So rather than using Hadoop as a staging area for processing and then putting the data into the enterprise data warehouse–as is the case with the Refinery use case–the data is left in Hadoop and then explored directly.
The Data Exploration use case can often be where enterprises start by capturing data that was previously being discarded (exhaust data such as web logs, social media data, etc.) and building entirely new analytic applications that use that data directly. Nearly every vertical can take advantage of the exploration use case. In financial services, we find organizations using exploration to perform forensics or to identify fraud. A professional sports team will use data science to analyze trades and their annual draft, like we saw in the movie Moneyball. Ultimately, data science and exploration are used to identify net new business opportunities or net new insight in a way that was once impossible before Hadoop.
Pattern 3: Application enrichment
The third and final use case is “Application Enrichment.” In this scenario, data stored in Hadoop used to direct an application’s behavior. For example, by storing all web session data (i.e. all of the session histories of all users on a web page), we can customize the experience for a customer when they return to the website. By storing all this data in Hadoop, we can keep a session history from which we’re able to generate real value–for example by providing a timely offer based on a customer’s web history.
For many of the large web properties in the world–Yahoo, Facebook and others–this use case is foundational to their business. By customizing the user experience, they are able to differentiate in a significant way from their competitors. This was the second use case for Hadoop at Yahoo as it realized Hadoop could help improve ad placement. This concept translates beyond the large web properties and is being used by the more traditional enterprise to improve sales. Some brick and mortar organizations are even using these concepts to implement dynamic pricing in their retail outlets.
As one might expect, this is most typically the last use case to be adopted–generally once organizations have become familiar with refining and exploring data in Hadoop. But at the same time, this also hints at how Hadoop usage can and will evolve over time to serve an ever greater number of applications that are served by the traditional database today.
There is certainly complexity involved when any new platform technology makes its way into a corporate IT environment, and Hadoop is no exception. Whether you’re using Hadoop to refine, explore or enrich your data, the interoperability with existing IT infrastructures will be key. That’s why we’re currently seeing immense growth within the Hadoop ecosystem and integration between different vendor solutions. Hadoop has the potential to have a profound impact on the enterprise data landscape, and by understanding the common patterns of use, you can greatly reduce the complexity.