Steering clear of the iceberg: three ways we can fix the data-credibility crisis in science

As I detailed yesterday, science has a data-credibility problem. There’s been a rash of experiments that no one can reproduce and studies that have to be retracted, all of which threatens to undermine the health and integrity of a fundamental driver of medical and economic progress. For the sake of the researchers, their funders and the public, we need to boost the power of the science community to self-correct and confirm its results.

In the eight years since John Ioannidis dropped the bomb that “most published research findings are false,” pockets of activist scientists from both academia and industry have been forming to address this problem, and it seems this year that some of those efforts are finally bearing fruit.

The research auditors

One interesting development is that a group of scientists is threatening to topple the impact factor, which ranks studies based on the journals in which they appear. This filter for quality research is based on journal prestige, but some scientists and startups are beginning to use alternative metrics in an effort to refocus on the science itself (rather than the publishing journal).

Taking a cue from the internet, they are citing the number of clicks, downloads, and page views that the research gets as better measures of “impact.” One group leading that charge is the Reproducibility Initiative, an alliance that includes an open-access journal (the Public Library of Science’s PLOS ONE) and three startups (data repository Figshare, experiment marketplace Science Exchange, and reference manager Mendeley). The Initiative isn’t trying to solve fraud, says Mendeley’s head of academic outreach William Gunn. Rather, it wants to address the rest of the dodgy data iceberg: the selective reporting of data, the vague methods for performing experiments, and the culture that contributes to so many scientific studies being irreproducible.

Stamp of ApprovalThe Initiative will leverage Science Exchange’s network of outside labs and contract research organizations to do what its name says: try to reproduce published scientific studies. They have 50 studies lined up for their first batch. The authors of these studies have opted in for the additional scrutiny, so there is a good chance much of their research will turn out to be solid.

Whatever the outcome, though, the Initiative wants to use this first test batch to show the scientific community and funders that this kind of exercise is value-adding despite the costs, which are estimated to be $20,000 per study (about 10% of the original research price tag, depending on the study).

Gunn likens the process to a tax audit: not all studies can or should be tested for reproducibility, but the likely offenders may be among those that have high “impact factors,” much like high-income earners with many deductions warrant suspicion.

A stumbling block may be the researchers themselves, who like many successful people have egos to protect; no one wants to be branded “irreproducible.” The Initiative stresses that the replication effort is about setting a standard for what counts as a good method, and finding predictors of research quality that supersede journal, institution or individual.

The plumbers and librarians of big data

While the Reproducibility Initiative is trying to accelerate science’s natural self-correction process, another nascent group is working on improving the plumbing that serves data. The Research Data Alliance (RDA), which is partially funded by the National Science Foundation, is barely a few months old, but it is already uniting global researchers who are passionate about improving infrastructure for data-driven innovation. “The superwoman of supercomputing” Francine Berman, a professor at Rensselaer Polytechnic Institute, heads up the U.S. division of RDA.

The RDA is structured like the World Wide Web Consortium, with working groups that produce code, policies for data interoperability, and data infrastructure solutions. As of yet there is no working group for data integrity, but it is within RDA’s scope, says Berman. While the effort is still in its infancy, the broad goals would be to come up with a way to make sure that the data contained in a study is more accessible to more people, and also that it doesn’t simply disappear at a certain point because of, say, storage issues.  She says with data it’s like we’re back in the  Industrial Revolution, when we had to create a new social contract to guide how we do research and commerce.

The men who stare at data

visualization-examplesYou can build places for data to live and spot-check it once it’s published, but there are also things researchers can do earlier, while they’re “interrogating” the data. After all, says Berman, you’re careful around strangers in real life, so why jump into bed with your data before you’re familiar with it?

Visualization is one of the most effective ways of inspecting the quality of your data, and getting different views of its potential. Automated processing is fast, but it can also produce spurious results if you don’t sanity-check your data first with visual and statistical techniques.

Stanford University computer scientist Jeff Heer, who also co-founded the data munging startup Trifacta, says visualization can help spot errors or extreme values. It can also test the user’s domain expertise (do you know what you’re doing and can you tell what a complete or faulty data set looks like?) and prior hypotheses about the data. “Skilled people are at the heart of the process of making sense of data,” says Heer. Someone with domain expertise who brings their memories and skills to the data can spot new insights, and in this way combat the determinism of blindly collected and reported data sets. Context, in the form of metadata, is rich and omni-present, Heer argues, as long as we’ve collected the right data the right way. Context can aid in interpretation and combat the determinism of blindly collected and reported data sets.

The three-pronged approach — better auditing, preservation and visualization — will help steer science away from the iceberg of unreliable data.