The week in big data on Twitter, visualized

I decided to play around a little more with ScraperWiki this week to see what people were talking about when they talked about big data on Twitter. The idea was to kill two birds with one stone: (1) demonstrate once again what’s possible in the realm of data visualization and mining even for novices using free online tools, and (2) give a little taste of what got people excited in the past seven days.
There are scientific studies and then there are collections of numbers, words and charts that purport to say something. This is decidedly the latter, but I really just wanted to see what types of stuff I could do with the data. If it’s at all interesting or useful, let me know. Maybe it can become a weekly thing.
Without further ado, here are some highlights of what I found, based on a sample of just more than 33,000 tweets mentioning “big data.”
Here are the general stats from ScraperWiki, showing the number of tweets collected, the most-mentioned users, screen names (read “tweeters”) and hashtags, and other info. (Click on any of the images to see a larger version.)

There were only a handful that were geotagged, but you can see how popular a topic big data is around the globe.

Next, I uploaded the data to IBM ManyEyes and eliminated obvious or irrelevant terms like “big,” “data” and “RT.” Apparently, people care a lot about analytics.

This phrase net from ManyEyes is pretty cool, too. It’s showing the most common three-word phrases where “is” is the middle word. As you can see, people think data is doing a lot of things.

From ManyEyes again, this time showing the most popular words before and after "is".

From ManyEyes again, this time showing the most popular words before and after “is”.

Bill Gates had the most retweets in the dataset — 382 of a single tweet of him announcing his talk at the Microsoft Research Faculty Summit, which we covered here. Twitter says it has been retweeted a total of 682 times. I went to Tableau Public for this because it can handle a lot of rows and it’s able to show records for everyone mentioned in the data, not just the top few like the ScraperWiki summary.

Here’s the tweet.

Mashable also had a rather popular post highlighting “5 big data projects that could impact your life.” Just filtering by Twitter accounts will miss mentions that don’t include the source, but a tool like ManyEyes lets you easily search by title (including variations on it).
Then, I decided to expand the scope to include “Hadoop,” “machine learning” and “data science OR data scientist.” Each of these topics had a much smaller sample size (these scrapes ran for a much shorter time and I assume there’s just smaller number of people talking about them), but here are some highlights from each.


@BigDataBorat is the undisputed king of Hadoop retweets, and his tweets apparently have no shelf life. Here was the most popular this week.

I thought this phrase cloud was pretty cool, too. It highlights Hadoop’s place in the big data stack, with the connector being “and”.

Machine learning

A handful of news items really dominated the machine learning discussion, namely: Ayasdi’s funding, Cloudera buying Myrrix, the new BloomReach mobile service and a story about recording the sounds of endangered species using iPods.

Although, people do seem to think highly of machine learning. Here’s a phrase net again with “is” as the connector.

ml is

People seemed to like this news from Joe Hellerstein, too.

Data science

After eliminating all the obvious terms, this word cloud from ManyEyes suggests that the University of California, Berkeley’s decision to offer an online graduate degree in data science was a big deal.
ds cloud
As was Unix. Why? Because Tim O’Reilly still loves the Unix command line.

And because GrubHub “data nerd” (his term) Greg Reda wrote a post titled “Useful Unix commands for data science.”
In terms of sheer volume of retweets, data mining website KDnuggets generated the most volume across a handful of them. This tweet — unattached to news, a blog post or Tim O’Reilly — just struck a nerve.

And, finally, this one points to a rather insightful post about data science from Jetpac CTO Pete Warden.