Tap11 Tries to Tame the Twitter Data Firehose

Braxton Woodham, Tap11, at Structure Big Data 2011When it comes to social data, one of the biggest firehoses around is the one that comes from Twitter, which recently announced that more than 140 million tweets are sent every day. Not surprisingly, trying to make sense of that data in real-time is a significant challenge according to Braxton Woodham, co-founder and chief technology officer of Tap11, one of a number of startups that are trying to sift through social media to find out what people are saying about companies and their products. Woodham talked about his company’s approach at the Structure: Big Data conference in New York City today.

One of the problems with social data and the real-time web, Woodham said, is that most of the data is “in a very fragmented state” because it is located inside different companies, many of which will have different privacy models and data-collection rules, and so creating what Tap11 calls the “Omniture of the real-time web” — that is, an analytical platform like the one that most websites use — is a difficult proposition. The Tap11 co-founder said that the analogy he likes to use for the real-time web is the oil industry.

You’ve got this new resource — which is the data — and Facebook is sort of like OPEC, and Twitter is like the Russian oilfields. First you have to build a pipeline, which is what our infrastructure project was all about, then you have to figure out a way to refine the data so it’s useful for your customers, and then you have to distribute it.

Although the company plans to expand to other social platforms, Woodham says that it started with Twitter because it is the most open network of that scale. When Tap11 first started pulling in the firehose, he says, it was 90 million tweets a day — now it is up to 140 million or about 1,600 tweets per second.

The Tap11 co-founder says that most companies want to do three things with the data the company pulls in: they want to listen to conversations about their brands or products and find the most influential people; they want insights into the performance of tweets — retweets, replies, and so on — and then they want details on how many people are responding to their updates and when.

To accomplish that, Woodham says that the company built a system that involves a simple first-match filter — which acts like a bouncer at a nightclub and removes any tweets that aren’t about a specific subject — and then a queuing system for storing the data so that it can be analyzed. After trying a number of different solutions, he said Tap11 eventually settled on Kestrel, which was developed by Twitter. The company then takes all of that data into an archive where it can be analyzed, which is built on Hadoop.