When data lakes become landfills: how to avoid drowning in surplus information

We hear about “big data” everywhere these days and, for companies, it’s assumed to be a good thing: more knowledge, more analytics, more scale. But, in reality, big data pools can become an expensive, incoherent mess that no one is sure what to do with.

These “data landfills” are a problem, especially for companies that are dipping their toes into big data for the first time. At GigaOM’s Structure:Europe event on Thursday, four data infrastructure expert described the problem and how to avoid it.

Raymie Stata, a former Yahoo CTO who is now CEO of Altiscale, explained that there is a lack of good meta-data tools for Hadoop that can identify data lakes, and flag which ones are useful for various parts of a business.

Stata said that too often it’s only a firm’s operations team that sees the raw data, depriving product teams of a chance to see rich signals they could use to respond to customers. Meanwhile, the data continues to pile up, creating costs, and leaving firms to ask questions like:

“Where did this come from? What was it derived from? What’s the retention policy? Who can get rid of it?”

According to Adam Fuchs, the CTO at Sqrrl Data, the problem is based in part on the fact that most big data tools are built for the developer community, not for the customer-facing parts of a business. This means that sales and other units in a firm don’t have a practical way to understand the data their company controls nor how to turn it into business opportunities.

The good news is that the tide may be turning. Ron Bodkin, the CEO of Think Big Analytics, said the siloed view of data is waning:

“The traditional approach of analytics and warehousing — those days are numbered,” said Bodkin, adding that it’s not acceptable anymore for developers to keep data in a “glass house” and only hand it over once it’s been perfectly curated.

The better approach is to treat data in a more agile manner, and create teams that consist of both an IT person and a business executive. Together they can scan the data for opportunities that will yield a quick and concrete success — like producing a sale, or discovering and purging a malicious botnet. This type of success can, in turn, change the conversation about data and inspire more teams to start brainstorming.

Bodkin added that, historically, IT departments don’t regard themselves as obligated to add value to a firm but that, today, they should be as committed to experimenting and finding revenue opportunities as everyone else.

Some companies, especially those where big data is a core competency, already know this. One example is LinkedIn(s lnkd):

“Within the LinkedIn eco-system, we have done an innovative thing for sales people and marketing,” said Bhaskar Ghosh, Senior Director of Engineering and Data Infrastructure, explaining that the company’s engineers supply “interest graphs” that make it easier for the salesforce to identify and sell to customers.

The panel was moderated by Tim Moreton, the CTO of Acunu, who, like Bodkin, stressed the importance of getting “quick wins” with data as way to ensure the data lakes don’t turn into landfills.

Check out the rest of our Structure:Europe 2013 coverage here, and a video embed of the session follows below:

[protected-iframe id=”9075eb4454a2123ed165542fc93fbcfc-14960843-25766478″ info=”http://new.livestream.com/accounts/74987/events/2361507/videos/30396427/player?autoPlay=false&height=360&mute=false&width=640″ width=”640″ height=”360″ frameborder=”0″ scrolling=”no”]

A transcription of the video follows on the next page