Analysis of p-hacking in metabolomics data analysis

10 min readAug 10, 2017

(Thanks to @BiswapriyaMisra for his contributions, suggestions and English grammar corrections. And to @metaboknight for his massacres ;) )

Introduction

Replication or reproduction of findings remains one of the biggest challenges in science. A 2016 survey by Nature (http://www.nature.com/news/reality-check-on-reproducibility-1.19961) has shown 2/3 of researchers are concerned about science reproducibility. Lack of findings replication in a field may hamper its reputation and integrity. For example, in the field of cognitive psychology, lack of replication has implied the creation of a consortium in order to validate previous findings not replicated in later studies (Open Science Collaboration 2015). Currently, there is a very interesting discussion about the possible need to lower the p-value limit to 0.005 in order to improve the robustness of findings (Benjamin et al., 2017) at the expense of more false negatives, especially when higher necessary sample sizes cannot be achievable by low-budget research groups.

Career incentives push researchers to produce novel results that are eventually published in high impact factor journals. To fulfil this objective, data in a project is iteratively massaged and analysed until reaching the most interesting set of p-values. When this set is achieved, the hypotheses and methods of the original project may be modified in order to align with these p-values. This practice, called p-hacking (Head et al. 2015), conceals all the previous statistical tests performed.

The problem of p-hacking is that it is deemed to format the data in the way that produces the highest number of false positives using extraneous factors which do not appear in future studies. Although p-adjustment methods exist to minimize false positives caused by multiple testing, these methods are applied to the final data, so p-adjustment cannot take into account all the previous statistical tests applied. Several approaches to minimize this problem have been proposed, e.g. the pre-registering of studies to avoid posthoc hypothesis modification. Pre-registration does not imply the renounce to the creation of new hypotheses, but the recognition of a non-significant initial hypothesis to be taken into account in later statistical tests.

Metabolomics, the study of small molecules in cells, tissues, organs, and biofluids from organisms or environmental samples, is an emerging field to study biological processes. Its emergence, however, implies lack of current standardisation among studies and dataset processing workflows. This lack of standardisation is further complicated by the tight connection of the metabolome to the phenotype. This connection can result in higher inter- and intraindividual concentration variability compared to other -omics. Furthermore, the reliability of the principal analytical techniques for metabolite quantification, i.e. mass spectrometry (MS) and nuclear magnetic resonance spectroscopy (NMR), is challenged by limitations posed by this connection (e.g. signal misalignments because of pH variability in NMR spectra; matrix effects and contaminations in MS data). The resulting variability implies more possibilities to get false positives when data is oriented in a way to produce these false positives. Limited usage and standardisation of workflows in several metabolomic fields allows performing the necessary data massaging to maximize the p-value production without raising concerns (Simmons, Nelson, and Simonsohn 2011).

Methods

An interesting approach to evaluate the relevancy of these bad practices in the creation of false significant results is the analysis of the distribution of p-values themselves. Modification of the dataset to achieve the most significant set of p-values implies that most of the false positives achieved are of low effect size, then they have a tendency to be near the 0.05 limit. Recently, an R package ‘tidypvals’ has been developed and made available to the scientific community. The author exercised reanalysis of 2.5 million p-values across 25 different fields to “tidy up” millions of p-values by cross-study analysis of p-values extracted from literature (Leek, 2017; see link).

Ideally, the expected distribution of p-values should have been unimodal as in a typical p-curve (Simonsohn, Nelson, and Simmons 2014; Bruns and Ioannidis 2016 ), like the one seen from the economics results. However, a bimodal distribution of “weak” and “strong” p-values is instead observed for most fields, which one of the modes just below 0.05. The proportion of values just below 0.05 represents a good intuition of the p-hacking problem present.

The ‘tidypvals’ dataset links each p-value to the Pubmed identifier of the scientific article from which the p-value was extracted. This PMID code can be used to link the dataset to all metabolomics studies and to evaluate relevant factors in metabolomic studies (such as biofluid or platform). The data of results in Pubmed search of “metabolomics+keyword” (sorry for not trying “metabonomics” or other alternatives, life is too short and my to-do list is too long) can then be downloaded. The downloaded data from each study contain its PMID: this ID can be used to filter the ‘tidypvals’ dataset by the factor of interest. Then the density plots of these filtered p-values (corrected by the p-value sample size) can be used to observe the possible influence of these relevant factors in metabolomics statistical analysis.

Now, before beginning the analysis, just a quick glance to the influence of the kind of operator (‘less than’, ‘equals’, ‘greater than’) used during the statistical test:

Well, the use of another operator rather than ‘equals’ can be conceptually correct, but it is usually only applied when it is necessary. If you see a sudden change of operator in a ‘Results’ section, beware.

Results

First of all: how does metabolomics compare with the other -omics?

Genomics seems to be quite clearly less influenced by p-hacking, possibly because of the low variability provided by the phenotype. As @BiswapriyaMisra points out, it is strange that transcriptomics shows much more p-hacking as compared to genomics, when the “raw material” is the same (nucleotides differentiated by four possible nitrogenous bases). I guess that the reverse trend for the two peaks between genomics and transcriptomics gives an intuition of the kind of influence that the phenotype already exerts to the selectivity of DNA transcription and of the sheer dynamic and functional nature of transcriptome.

Metabolomics shows similar p-value distribution to transcriptomics and proteomics. Nevertheless, in all these -omics a much higher peak for weak p-values existing compared to that of strong p-values. This is a very worrisome trend, as it would render all these -omics among the most “p-hacking suspicious” fields on the ‘tidypvals’ plot shown above.

And now let’s go to the analysis of metabolomics related factors:

By year: there may be a worrying trend between the appearance of the article and density of weak p-values pre and post 2013. A possible explanation may be that the “low hanging fruit”, the phenomena most expected to be found significantly relevant through metabolomic studies, were mostly tackled pre 2013. Newer studies try to find more indirect influences to the metabolome, increasing the percentage of finding only small or false effects.
By species: interestingly, the ’human’ metabolomics articles have the lowest presence of weak p-values. In contrast, the other subsets show a higher peak of weak p-values than of strong p-values, especially in plants.

An intuitive explanation for this difference is that ‘animal’, ‘plant’ and ‘bacteria’ studies actually represent the studies in a high number of species. The previous bibliography in study workflows in each one of these species is smaller than the one in humans. Consequently, the standardisation of the study workflow for this species becomes much lower and there is much more room for manoeuvre during the study workflow. As a result, it becomes easier to introduce variability that causes false positives or to process the dataset in a way that helps find significant p-values without raising plausible concerns.

@BiswapriyaMisra adds that metabolomics analysis in plants presents much higher complexity (Aksenov et al. 2017). This higher complexity supposes a higher challenge to standardize procedures.

By biofluid: it seems there could be a higher presence of strong p-values in urine. This may make sense, as urine presents factors such as high dilution and metabolite concentration variability that may ease the presence of skewed distributions that create strong p-values.
By chromatography: it seems liquid chromatography could have more weak p-values than gas chromatography. @BiswapriyaMisra suggests the possible influence of ESI/EI differences and the lower magnitude of effects from column and matrix.
By platform: Talking of MS vs NMR… Interestingly, the subset by platform gives the most similar density plots. Sorry, platform fanatics: no new input to reinforce your prejudices.
By “approach” type: Targeted approaches give a less clear bimodal distribution. Intuitively, it makes sense: the lower number of features analysed gives less possibility to find strong effects. Nonetheless, there still exists a higher presence of weak p-values than of strong p-values in targeted approaches.

Bonus track

I have also calculated the average number of reported p-values depending on the factor to study:

Some quick notes:

I find fascinating that a much lower number of reported p-values is found with the ‘greater than’ operator than with the ‘less than’ operator. Any plausible theories?
Older studies reported a lower number of p-values than newer ones. This is a plausible indicator of the progress in the quantification of metabolites. This progress allows finding more features to study significant differences.
Plasma reports more p-values than urine, even when urine has more features to analyse! Definitely, I am biased by my experience with urine. A possible reason is that the high concentration variability and analysis complexity in urine supposes a source of variance and of current lesser bibliography which hampers both quantification and statistical analyses of differences between groups.
Plant studies give fewer p-values per study than animal ones, even having much higher complexity. My initial intuition is that maybe there is less knowledge of how to properly annotate and quantify the compounds present. @BiswapriyaMisra adds that generic metabolome extraction protocols may not capture the plant-specific compounds and extraction may need further specialization etc.
It is counterintuitive that targeted approaches report a similar number of p-values than untargeted approaches. Maybe reported targeted approaches are derived from exploratory untargeted approaches, so quantified metabolites in targeted approaches are “secure” shots.

Conclusion

The possible increase of not reproducible or replicable results can pose a threat to the emergence of metabolomics. Nevertheless, minimizing p-hacking will not come from the researchers’ goodwill: for most of us, stakes may be sometimes too high to sacrifice. Some preventive measures (such as the pre-registering of studies, the workflow standardization, the promotion of study dataset repositories, or the reevaluation of the meaning of “significant” -Colquhoun 2017-) are required to force us not to fall into temptation. However, it is necessary to first provide an intuition about the scope of the problem in order to promote these prevention measures.

I hope this post contributes to a more substantive evaluation of the p-hacking problem present in the metabolomics field too. The density plots show a clearly bimodal distribution where the peak of “weak” p-values is higher than the peak of “strong” p-values. The comparison of the metabolomics p-value distribution compared with the ones present in other science fields shows a worrying behaviour that is only increasing in newer studies. This post also raises some possible metabolomics areas where extra care should be taken when performing or evaluating studies. Personally, I would give a lesser weight to any not adjusted p-value which falls into the weak p-value distribution.

In the future, I would like to study other factors such as the number of citations or the journal impact factor where the article was published. I would also like to expand the analysis to other biofluids or techniques, or also the effect of “hype”: how the lack of standardisation in new applications of metabolomics interferes with the achievement of p-values. Nonetheless, it would be necessary to ensure a proper sample size and exploratory analysis to avoid confounders. In any case, I gratefully accept suggestions and contributions to expand the insights and refine the inaccuracies of this analysis.

Conflicts of Interest

I am against p-hacking.