Repeat after me: ‘Google is not a proxy for big data’

Another year, another report about how inaccurate the Google Flu Trends predictions turned out to be for the previous year. And more warnings about the dangers of relying on Google, and therefore “big data” and algorithms, for important stuff.

Repeat after me: Google is not a proxy for big data. It also isn’t supposed to replace the Centers for Disease Control. Even it wouldn’t make that claim.

I made the same argument in more detail when this concern popped up last year, but here it is in a nutshell: “Big data” isn’t the enemy, it’s a friend. So are algorithms. But they must be used correctly.

Google is a great source of data for Google. Twitter is a great source of data for Twitter. Facebook is a great source of data for Facebook. For everyone else, they’re just additional sources of data of varying value depending on what’s being studied.

The latest study to question the accuracy of Google Flu Trends (published in Science under the title “The Parable of Google Flu: Traps in Big Data Analysis”) found that combining Google’s predictions with the CDC’s predictions actually resulted in the most-accurate model. It sounds like Google’s data is actually valuable in this case, so what’s the problem?

We’ll talk a lot about where big data is headed at our Structure Data conference next week, but I laid out some of the themes in a post on Wednesday. We’re heading toward a world where we can quantify nearly anything and then use that data to feed models that are, in theory, more accurate than anything we’ve yet been able to achieve. Organizations able to take advantage of all these new data sources could reap serious rewards.

But old caveats still apply. Correlation doesn’t mean causation, Twitter is still a very small sample of the U.S. population and a bunch of paranoid parents researching flu vaccines don’t portend an epidemic.

Feature image courtesy of Shutterstock user Creativa.