Finding markers of non-reproducible findings by the analysis of p-values (part 1)

How the machine learning-based prediction of strong or weak significant p-values can help us improve current metrics or identify subfields which might need to develop higher standards for publication. Part 1: introduction and exploratory analysis

Daniel Cañueto
9 min readJan 10, 2019

Introduction

Reproducibility of study findings is one of the biggest challenges in science research. A 2016 survey by Nature shows 90% of researchers are concerned about reproducibility in science (Baker 2016). According to this survey, more than 70% of scientists have failed to reproduce the results of previous experiments. Also, more than half have failed to reproduce even the ones of their own experiments.

Post-peer-review methods (e.g., citation count, retraction, replication) help correct non-reproducible findings. But, this process can endure years. Also, the retraction of the finding gets lesser public attention than the original publication. Meanwhile, an enormous amount of public funds is wasted in flawed hypotheses (Macleod et al. 2014). Hence, there have appeared high-profile retractions which challenge science reputation. Consequent public concerns about research integrity help promote non-scientific knowledge.

To respond to these issues, a new field called meta-research aims to study how scientific research is performed (Ioannidis et al. 2015). Besides, manifestos and projects have been developed to promote good research practices and the improvement of reproducibility (Munafò et al. 2017, Open Science Collaboration 2015, Benjamin et al. 2018).

Unfortunately, “publish or perish” incentives challenge the success of these initiatives (Liebowitz 2015). Novel findings maximise odds of publication in a high impact factor journal. So, career incentives push researchers to produce the highest amount of novel findings. This maximisation can be achieved by methods like data dredging or the misuse of hypothesis testing methods. These methods help generate significant p-values associated with non-reproducible findings.

This generation does not need to be conscious. For example, during a study, researchers might test several approved approaches to improve the quality of data or hypothesis testing. Yet, this evaluation of different approaches can generate significant p-values randomly. P-value corrections help minimise the random generation of significant p-values. But, these corrections are applied to the final set of p-values. Corrections cannot control the effect of all the tests evaluated during the study workflow.

Different stages where non-reproducible significant results can be generated. From Munafo 2015.

To reduce the generation of non-reproducible findings, it is necessary to create sticks which complement the current carrots. Promising approaches are already being developed and applied. For example, the analysis of inconsistent statistics helped identify repeated fraud from a top Cornell food research group. However, these approaches can help us identify individual cases but not general tendencies to correct in a field. For example, in some research fields, data has higher inherent variability. It is intuitive that these fields should apply higher protocol standards or lower their p-value cutoffs to achieve the same reproducibility than in other fields.

Exploring the potential reproducibility of findings by analysing the strength of their p-values

This blog post introduces the exploration of a new approach in a paper in progress which will be public very soon. This approach is based on the analysis of the strength of significant p-values to enhance the reproducibility of research. Previous research has shown that one of the best predictors of the reproducibility of findings is the strength of the p-value (Open Science Collaboration 2015): the weaker a significant p-value, the more probable the significant result is non-reproducible:

Source: http://statisticsbyjim.com/hypothesis-testing/reproducibility-p-values/

In a previous blog post, I explained how it is possible to collect a dataset of milions of p-values. Now, imagine that we can collect information from the articles which are the source of the p-values. Then, we should be able to identify factors associated with a higher generation of weak significant p-values (and, hence, with lower reproducibility).

To better understand this concept, it is useful to first look at the distribution of this collection of millions of p-values. (The distribution has this shape:

We can observe in the density plots several peaks which show the typical p-value roundings (e.g., 0.0001, 0.001, 0.01, 0.05). The highest peak corresponds to the p<0.05 cutoff. Intuitively, incentives motivate the choice of p-value rounding which shows greater p-value strength (i.e., if your p-value is 0.008, you round to p<0.01 and not to p<0.05). So, most p<0.05 p-values will be between 0.01 and 0.05. This assumption helps classify p-values by their strength and, as a result, the analysis of which factors promote a higher proportion of weak (i.e. between 0.01 and 0.05) significant p-values.

Next, to further strengthen this heuristic lower :
proportion of weak significant p-values
<-> proportion of non-reproducible findings
we will analyze the association between the citation count of articles and the proportion of weakly significant p-values.

The citation count of an article is a metric which tries to parameterize its research quality. Intuitively, the aim of maximizing the research quality in a study is maximizing the number of reproducible significant findings. Hence, the citation count should be correlated with the strength of p-values. For example, here it is shown, for several -omics fields, the density plots of the distribution of p-values when grouping by quartiles of citation count:

The x-axis shows the p-value. The highest peak corresponds to the 0.05 cutoff. The association between a higher proportion of 0.01–0.05 p-values and a lower quartile is evident. In all -omics fields, the lowest quartile (red trace) is the one with the highest proportion of 0.01–0.05 p-values. Also, in all fields, the highest quartile (purple trace) is the one with the highest proportion of stronger p-values (e.g. <0.01 or <0.001).

Now that we have a further intuition about how the strength of p-values matches previous heuristics, we can start analyzing how some factors can interfere with the proportion of weak significant p-values. For example, this is the density plot of the distribution of p-values depending on the -omics field:

The proportion of p<0.05 p-values is in this ascending order:

Genomics<Transcriptomics<Metabolomics<Proteomics

People with bioscience knowledge should have found out the most probable reason for this order. If not, I’d bet it is in this figure:

Source: http://biotech.nature.com

Generally, the further from the original material (the DNA), the more variability is added in the data by the phenotype and the more probable this variability produces non-reproducible significant results. In addition, it is also possible that the emergence of the field and the number of researchers in this field plays a role. The more established and with more practitioners a field is, the more developed standards it has. Emergent fields such as metabolomics are still developing standards. Intuitively, the accidental generation of p-values by suboptimal protocols in this field should be higher.

The correlation with less established standards and a higher proportion of p<0.05 p-values might also explain the next figure, where the effect of the species/kingdom analyzed in -omics studies is shown:

The human is the most studied species. Therefore, it has the most developed standards and it makes sense it shows the lowest proportion of weak significant p-values. Other animals and the bacteria are also well studied. Therefore, they also show a reasonable generation of weak significant p-values. In contrast, studies with plants (purple trace) show an excessive generation of p<0.05 p-values. I think this result might be strongly explained by the lack of plant-based models for the study of humans. In contrast, in the case of animals and bacteria, most studies are developed in human models (such as rats or cell lines) with deeply established protocol standards.

Debate and introduction of next steps

I have shown that the proportion of weak (i.e., 0.01–0.05) significant p-values can vary depending on a factor (e.g., the research field, the species/kingdom analyzed). Which are the implications of this insight?

First, should we have the same p-value cutoff for all studies when some fields are associated with a higher chance of generating non-reproducible insights? Should not we enforce a lower p-value cutoff for these studies until approaches such as the implementation of more strict protocols remove this association?

This path is where the implementation of our analysis seems to have the most promising results. We might have a new metric which helps us analyze the reproducibility in a research field or subfield. This metric would help us focus our efforts to improve reproducibility in those subfields where improvement is most necessary. We would have at last a stick which motivates the improvement of reproducibility and complements the current carrots.

Second, if we identify factors which show an association with the strength of the p-value, these factors might further improve the information about research quality given by the citation count. For example, we might generate new metrics which combine several features. To test the quality of these new metrics, we might try to improve the prediction of weak significant p-values.

In part 2, I’ll show how several factors (the species/kingdom analyzed, the article’s year of publication, the author’s country of affiliation, the -omics field studied) can improve the prediction of a significant p-value as strong or weak. This prediction is based on a machine learning (ML) based classification approach. I’ll also use explainable ML techniques (e.g. ALE plots, feature removal, feature importance) to further understand how each factor influences the prediction. Last, you will have probably identified some of the current limitations of the approach (e.g., potential confounders, sample sizes differences, p-value corrections or p-value roundings). These limitations will be discussed along with some proposals to translate the exploratory analysis to a more implementable one.

References

Baker, Monya. 2016. “1,500 Scientists Lift the Lid on Reproducibility.” Nature 533 (7604): 452–54. https://www.nature.com/news/1-500-scientists-lift-the-lid-on-reproducibility-1.19970

Benjamin, Daniel J., James O. Berger, Magnus Johannesson, Brian A. Nosek, E-J Wagenmakers, Richard Berk, Kenneth A. Bollen, et al. 2018. “Redefine Statistical Significance.” Nature Human Behaviour 2 (1): 6–10. https://www.nature.com/articles/s41562-017-0189-z

Ioannidis, John P. A., Daniele Fanelli, Debbie Drake Dunne, and Steven N. Goodman. 2015. “Meta-Research: Evaluation and Improvement of Research Methods and Practices.” PLoS Biology 13 (10): e1002264. https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002264

Leek, Jeff. tidypvals: This is a package with published p-values from the medical literature in tidied form. R package version 0.1.0. https://github.com/jtleek/tidypvals

Liebowitz, Jay. 2015. A Guide to Publishing for Academics: Inside the Publish or Perish Phenomenon. CRC Press. https://www.crcpress.com/A-Guide-to-Publishing-for-Academics-Inside-the-Publish-or-Perish-Phenomenon/Liebowitz/p/book/9781482256260

Macleod, Malcolm R., Susan Michie, Ian Roberts, Ulrich Dirnagl, Iain Chalmers, John P. A. Ioannidis, Rustam Al-Shahi Salman, An-Wen Chan, and Paul Glasziou. 2014. “Biomedical Research: Increasing Value, Reducing Waste.” The Lancet 383 (9912): 101–4. https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(13)62329-6/fulltext

Munafò, Marcus R., Brian A. Nosek, Dorothy V. M. Bishop, Katherine S. Button, Christopher D. Chambers, Nathalie Percie du Sert, Uri Simonsohn, Eric-Jan Wagenmakers, Jennifer J. Ware, and John P. A. Ioannidis. 2017. “A Manifesto for Reproducible Science.” Nature Human Behaviour 1 (1). Nature Publishing Group: s41562–016–0021. https://www.nature.com/articles/s41562-016-0021

Open Science Collaboration. 2015. “PSYCHOLOGY. Estimating the Reproducibility of Psychological Science.” Science 349 (6251): aac4716. http://science.sciencemag.org/content/349/6251/aac4716

--

--

Daniel Cañueto

Data torturer at Glovo Fintech Risk team. A bad model is better than no model.