In the name of accuracy, Google retools its Flu Trends model

Google has responded to criticisms over its Flu Trends tool and has reworked its predictive model to also account for data from the Centers for for Disease Control and Prevention. It’s still not a replacement for actual scientific research, but should be more accurate.

Big Data catches the Google flu

Science magazine last week published an analysis of the errors and limitations in Google Flu Trends, the flu trend analysis that Google generates­ from its search engine traffic; and news stories this week—from The New York Times to the Financial Times—have offered their own spins on the findings. In the case of The New York Times, the story was an update from a year-earlier blog critique of reports that GFT was significantly missing the mark, particularly in comparison with the analysis from the Center for Disease Control, which has a three-week data lag.
The limits to many big data sources
The article in Science highlights the limitations of lots of search and social media-based data. For example, the algorithms may not be transparent (e.g., Google has never released the 45 search terms used for GFT), and they may be tweaked regularly (e.g. 86 reported changes to Google search in June and July 2012 alone), thus making initial analysis and attempts at replication, especially, fraught. Also, ‘red team’ attacks, whereby research subjects attempt to influence their own data for analysis (e.g., political campaigns and companies endeavoring to hit Twitter trending targets). ‘Big data hubris’ reflects a tendency to think larger data sets can replace more controlled, traditional data collection and analysis, despite frequent issues in measurement and construct validity and reliability and dependencies among data.
Lessons for moving forward
The Science article offers the following conclusions:

  • Transparency and replicability are central to science. Opaque and ever-changing search algorithms sharply limit the traditional scientific process of repeat and additive studies building on earlier findings.
  • Use big data to understand the unknown. GFT data can be combined with CDC data to marginally improve CDC findings. However, the potential for GFT to provide more localized data than is practical for the CDC is likely of greater complementary value.
  • Study the algorithm. What is being asked and how it is being asked is, of course, central to results—and to worthwhile subsequent analysis.
  • It’s not just about the size of the data. Big data offers great potential for new and more expansive research and analysis, but the Internet is also improving the potential for traditional data collection and analysis. Both methods are undergoing a revolution of sorts.

A further critique of found data
The FT article goes much further than the New York Times piece, launching into a further critique of ‘found data’ more broadly, which includes much corporate big data as well as Internet sources. Four common big data claims the FT piece addresses that are ‘at best optimistic oversimplifications’ include:

  • Data produces uncannily accurate results;
  • Every single data point can be captured, making old statistical sampling techniques obsolete;
  • It is passé to fret over what causes what, because statistical correlation tells us all we need to know; and,
  • With enough data, scientific and statistical models aren’t needed.

Indeed, the FT quotes one professor as terming the claims, “complete bollocks. Absolute nonsense.”
More big data concerns
None of the researchers in these studies and articles find big data to be without significant commercial and societal value. They simply offer cautions and caveats. But the speed with which big data collection and analysis is permeating society can hardly be overstated. The AP has a story this week on farmers bringing their concerns about agribusiness collection at their operations, including real-time, GPS-informed data feeds could be problematic. Seed companies such as Monsanto may be taking the lead on this, but entities from government agencies to commodity market traders see use in the data as well.
Among the conclusions to draw from these analyses are the following:

  • New norms for the accuracy and transparent communication of big data sources and new societal standards as to its privacy permissions and reach need to be developed.
  • Enterprises can minimize the adverse consequences of early use of the technology by preemptively establishing suitably conservative standards of their own.
  • The adoption of big data, though still in the early stages, is already pervasive.
  • With new uses found daily, we are all in for a bumpy, but fast and sometimes thrilling ride.

Repeat after me: ‘Google is not a proxy for big data’

Another study is reporting on the inaccuracy of Google Flu Trends project, which predicts seasonal flu rates based on search data. However, Google’s algorithms don’t constitute the “big data” approach to this issue, they’re just one piece of a smart big data approach.