OA Week 2017: Transparency and Reproducibility

By John Borghi and Daniella Lowenberg

Yesterday we talked about about why researchers may have to make their data open, today let’s start talking about why they may want to.

Though some communities have been historically hesitant to do so, researchers appear to be increasingly willing to share their data. Open data even seems to be associated with a citation advantage, meaning that as datasets are accessed and reused, the researchers involved in the original work continue to receive credit. But open data is about more than just complying with mandates and increasing citation counts, it’s also about researchers showing their work.

From discussions about publication decisions to declarations that “most published research findings are false”, concerns about the integrity of the research process go back decades. Nowadays, it is not uncommon to see the term “reproducibility” applied to any effort aimed at addressing the misalignment between good research practices, namely those emphasizing transparency and methodological rigor, and academic reward systems, which generally emphasize the push to publish only the most positive and novel results. Addressing reproducibility means addressing a range of issues related to how research is conducted, published, and ultimately evaluated. But, while the path to reproducibility is a long one, open data represents a crucial step forward.

“While the path to reproducibility is a long one, open data represents a crucial step forward.”

One of the most popular targets of reproducibility-related efforts is p-hacking, a term that refers to the practice of applying different methodological and statistical techniques until non-significant results become significant. The practice of p-hacking is not always intentional, but appears to be quite common. Even putting aside some truly astonishing headlines, p-hacking has been cited as a major contributor to the reproducibility crisis in fields such as psychology and medicine.

One application of open data is sharing the datasets, documentation, and other materials needed to reproduce the results described in a journal article, thus allowing other researchers (including peer reviewers) can check for errors and ensure that the conclusions discussed in the paper are supported by the underlying data and methods. This type of validation doesn’t necessarily prevent p-hacking, but it does increase the degree to which researchers are accountable for explaining marginally significant results.

But the impact of open data on reproducibility goes far beyond just combatting p-hacking. Publication biases such as the file drawer problem, which refers to the tendency of researchers to publish papers describing studies that resulted in positive results while regulating studies that resulted in negative or nonconfirmatory results to the proverbial file drawer. Along with problems related to small sample sizes, this tendency majorly skews the effects described in the scientific literature. Open data provides a means for opening the file drawer, allowing researchers to share all of their results- even those that are negative or nonconfirmatory.

“Open data provides a means for opening the file drawer, allowing researchers to share all of their results- even those that are negative or nonconfirmatory.”

Open data is about researchers showing their work, being transparent about their how they make their conclusions, and providing their data for others to use and evaluate. This allows for validation and helps combat common but questionable research practices like p-hacking. But open data also helps advance reproducibility efforts in a way that is less confrontational, but allowing researchers to open the file drawer and share (and get credit for) all of their work.