Data journalism could use a jolt of data science, too

Data journalism is supposed to change the way we think about journalism, so it might want to start by changing the way we think about data.
Albert Cairo wrote a thoughtful piece on the Nieman Journalism Lab blog Wednesday explaining how data journalism efforts such as FiveThirtyEight, Vox and the Upshot have over-promised and under-delivered on the quality of their content. I agree with much of what he wrote, but would add one more suggestion: Data journalism needs to embrace data science.
The ideal data scientist, it’s often said, should have foundational skills in statistics/math, processing unstructured data, querying data using SQL and programming. These are requirements born of the web, of course, where data often takes different shapes than just numbers in a table, and so takes some legwork before it can be analyzed using traditional methods. Lacking express or firsthand data about the users or behaviors they wish to analyze, data scientists have become adept at combining disparate data points to builder better user models or infer certain traits.
These methods aren’t always conducive to the often fact-based world of journalism, but there’s a place for them. For example, I don’t need to see seven different articles or blog posts every month breaking down the same jobs-report numbers using different types of charts. There are plenty of data sources that might provide a different angle on the economy, including social media, real estate websites, and large public datasets from cities, countries and government agencies.

A collection of recent data on food prices from Premise Data.

A collection of recent data on food prices from Premise Data.

One startup I really admire, Premise Data, decided traditional reports on the global economy were too slow and often lacked a view into what’s happening on the ground in many areas, so it decided to to start generating its own forecasts. A network of citizens in various cities around the world take photos of specific things at specific times (e.g., the milk shelf at a local market) and Premise mines those photos for information on prices, supply and other things.
The point is, there’s a whole web of data out there for journalists willing to find it and do something creative with it. There are large geosocial datasets such as GDELT and Yahoo’s Flickr corpus for images. There are APIs from various sites, social media platforms and even music specialist the Echo Nest (which is now part of Spotify). There are untold numbers of web pages, posts and other text content, as well as hundreds of millions or even billions of photos, all waiting to be scraped and analyzed.
If there’s nothing good readily available, it’s not inconceivable that news organizations could create their own data stockpiles. Like what Premise does, or like what this entomologist did in order to get quality data about the sounds of insect wings and the bugs’ activity cycles.
Google Correlate (and other Google tools) aren't scientific, but they're an easy source of trend data.

Google Correlate and other Google tools aren’t scientific, but they’re an easy source of data about what’s on people’s minds.

Despite the cliche that numbers don’t lie, they often do. Or, as Cairo points out in his Nieman Journalism Lab post, they’re at least open to interpretation and mischaracterization. So why not strive for fuller analysis by looking beyond the official numbers and those in widely publicized studies, and start thinking about what dots can be connected using social media, what text can be analyzed for themes and sentiment and, generally, what additional data can be pulled in to build a stronger argument or more accurate prediction?
Quantifying the topics that matter and trying to enlighten readers is a noble goal, but it’s hard to do so using the same data that has been around forever and hasn’t yet really served to accomplish those goals. I say get creative or risk reproducing the same old results in a newer, prettier package.
Feature image courtesy of Shutterstock user ramcreations.