Demystifying the p-value

Dr. Dennis Robert MBBS, MMST
The Startup
Published in
17 min readJun 26, 2020

Does everything has to be decided by a p-value? Why scientists have been very vocal about the misuse of p-value?

Abstract

Incorrect interpretation of p-value and an absolute dichotomous trend in statistical significance based on a single threshold (usually 0.05) of p-value for arriving in major clinical and epidemiological decisions is a dangerous practice with significant repercussions such as lack of reproducible research and weak evidences. This in turn puts patient’s lives at risk. It is to be remembered that statistical significance of the treatment or intervention effect does not always translate to clinically meaningful effect. The misuse of p-values can lead to a lack of reproducibility of clinically meaningful treatment effects in real world scenario (actual clinical practice). The debate against this ‘dichotomania’ has probably reached an inflection point⁷. This has the potential to bring a paradigm shift in conducting and interpreting hypothesis testing with significant changes to the way the biomedical research ecosystem functions in reporting and interpreting quantitative research. In this article I try to explore the background and present status of the debates on p-value whilst also trying to demystify some of the misunderstandings of p-values.

Credits: xkcd

Introduction

The most widely practised methodology for determining statistical significance during null hypothesis significance testing (NHST) is by using p-value. For the uninformed, NHST helps scientists to extrapolate the findings from your sample data to population (you would always need to find if your sample results are indeed generalizable!). The last step in a typical NHST is the determination of statistical significance. The idea behind this is to determine if your sample results are statistically significant or not. This helps in extrapolating the results to population level. Statistical significance is quantitatively determined by p-value. This is why it is a holy grail in research. Reporting of p-value is very common in scientific literature. In an analysis¹ done using PubMed papers published between 1990 and 2015 the authors reported that the use of p-values has increased over the last 25 years and about 78.4% of papers in clinical journals reported p-values² which was higher than most other disciplines. It is to be noted that the reporting of p-values in research publications is not a bad practice by any means. It is the misuse of p-value for arriving in categorical or dichotomous conclusions which is what must be curtailed. Using p-value alone to conclude whether there is an effect or not is a bad practice for interpreting the results and yet this practice is widely prevalent. One analysis using 791 published research papers published in 5 different journals found that about half of them interpreted the p-values incorrectly by assuming non-significance means no effect³ absence of evidence is not evidence of absence. Lack of understanding of what exactly is p-value is rampant among medical fraternity⁴.

In an attempt to start putting an end to the dangerous practice of misusing p-values, drastic measures have been adopted by some. In February 2015, the journal ‘Basic and Applied Psychology’ has released an editorial saying it will be banning the use of null hypothesis testing procedure including p-values⁵, but whether such an outright ban will improve the veracity of scientific and research output is very much open to debate. It cannot be more overemphasised that it is the misinterpretation of p-values which is the crux of the whole problem and not p-values per se. In 2016 American Statistical Association (ASA) released a “Statement on Statistical Significance and P-values” intended to give a strong message against the misuse of p-values⁶. This was a landmark event in the field of hypothesis testing and paved the way for widespread research intended to formulate strong and robust guidelines in interpreting statistical significance including giving more importance to p-values such as effect sizes.

In March 2019, more than 800 scientists (statisticians, clinical researchers, biologists and psychologists) from across the globe have unambiguously called for the entire concept of statistical significance to be abandoned⁷. The ASA also published a special edition of their journal the same month titled ‘Statistical inference in the 21st century: a world beyond p < 0.05) and it includes more than 40 papers all pertinent to exploring methodologies and strategies for moving beyond the p-values⁸. It could be assumed that we have now reached an inflection point after years of debates on p-values and will likely pave the way for robust interpretations of null hypothesis testing results.

So what exactly is this p-value?

Statistical definition of p-value: The p-value is the probability of obtaining a result at least as extreme as the current obtained result when the null hypothesis is true.

It is important to recognize that the notion of using a threshold for p-value such as 0.05 is related to the probability of rejecting a null hypothesis when it is actually true (type 1 error). This probability is also known as level of significance and is usually denoted by the Greek letter α (alpha) and is always set before the beginning of the study. Type 1 error rate is usually controlled at a level of 5% (0.05). This is why a threshold of 0.05 is usually set for p-value. If this probability is less than the pre-set type 1 error rate (5%), then the null hypothesis is said to be rejected at 5% level of significance. Note that there are situations when you would need to be more stringent about this threshold to reduce false positive rates. This is not the point of discussion here. Two keywords for those who are interested to read more on this would be ‘multiple comparisons problem’ and ‘Bonferroni correction’.

Alright, back to p-value again. The exact details as to how researchers compute p-value during hypothesis testing differ depending on the statistical test and outcome measurements. But on a high level the steps involved in computing p-value is more or less same in that all null hypothesis significance testing (NHST) effectively reduces the statistical comparisons to a statistic which is a single numerical value computed from the sample data. In NHST this statistic is also referred to as test-statistic. Depending on the statistical method/test used to perform the NHST, the names of test-statistic can vary. For example, in the famous chi-square test, the test statistic is known as chi-square test statistic. For t-test, this statistic is known as t-statistic. Nevertheless, the concept of test-statistic remains same across all statistical tests. That is, it is a statistic, a mathematical quantity, that is computed from your sample. The computed test-statistic follows a known sampling distribution (probability distribution) such as normal (Gaussian) distribution or Student’s t-distribution or a chi-square distribution under null hypothesis. For the uninformed, why we need to ‘assume’ that a particular test-statistic follows a particular sampling distribution is that this is how we can actually make inferences about the population. That is, how you can make inferences about the population using your sample.

The whole procedure can be summarized in the following four high-level steps:

Step 1: Set the level of significance (acceptable type 1 error rate)

Step 2: Formulate the null and alternate hypothesis

Step 3: Compute test-statistic from sample data by statistical tests

Step 4: Compute the p-value from the sampling probability distribution of test-statistic

p-value is a cumulative probability and it is a straight-forward procedure to compute cumulative probabilities from known probability distribution. As mentioned earlier in the statistical definition of the p-value, the current obtained result in the definition equates to a test-statistic and the probability of obtaining the result at least as extreme as the current obtained result is nothing but the conditional cumulative probability which can be expressed as below:

Ho is the null hypothesis. ‘mod’ function is nothing but the absolute modulus function. x is just a representative variable for representing a hypothetical value. The equation translates to ‘probability of ‘x’ more than or equal to the absolute value of test statistic given that the null hypothesis is true. This is equivalent of saying probability of obtaining a result at least as extreme as your obtained result. The obtained result is nothing but the test_statistic.

The above probability can be easily computed if you know the distribution of your test-statistic as discussed earlier. This distribution is assumed to follow a distribution such as chi-square or normal distribution depending on the test-statistic. For example, if the test-statistic is assumed to follow a standard normal distribution, the level of significance (α) is set at 5% and the obtained test-statistic value is 2.5, then the resultant p-value is 0.0124 as shown if figure 1.

This figure represents the Standard Normal Distribution (mean is 0 and standard deviation is 1). For the uninformed, the possible values of test-statistic is on the x-axis and the corresponding probability is shown in the y-axis. The test statistic in the example shown in figure is denoted by Zo. Z(alpha/2) in the figure represents the threshold for Type 1 error rate. Since alpha is 0.05 (level of significance) and that we are primarily focusing on a two-sided hypothesis testing, we have to take alpha/2 in both sides so that the total equals to alpha. This is why it is represented as Z(alpha/2). The corresponding value for Z(alpha/2) in a standard normal distribution is 1.96. Any region to the right of 1.96 and to the left of -1.96 is known as the critical region. p-value is nothing but the total area to the left of -Zo + area to the right of Zo. The region shaded in pink is the critical region and if the test statistic falls within this pink region then it means that the p-value is less than type 1 error rate. This means that p-value is less than 0.05. This picture is taken from an online hypothesis testing simulator⁹ (reference no. 9)

The critical region, shaded in pink colour in the above figure, is the region of values that can lead to rejection of null hypothesis and you can see that this is inherently dependent on the pre-defined level of significance (Zalpha). The cumulative probability within the critical region is equal to level of significance. Thus, if the test-statistic falls within the critical region, it can be inferred that the probability of having a value atleast as extreme as the obtained value, which is nothing but the p-value will always be equal to or more than the level of significance (0.05 in the example). As the magnitude of the test-statistic increases, the p-value also tend to decrease as is evident from the figure. The important point here is to note that all of these is assuming that the null hypothesis is true as per the statistical definition of p-value and that is why p-value is a conditional cumulative probability. If the null hypothesis was indeed true and your test-statistic doesn’t fall within the critical region, it is an indication that given the current sample data it is not possible to reject null hypothesis at the pre-set type 1 error rate (level of significance). The chances of rejecting a null hypothesis increases as you increase the level of significance, but this will lead to an increased type 1 error rate (false positives). Also, if you suppose that there indeed is a large effect and that you have done your study design with utmost care to avoid any large bias, then theoretically your test-statistic would also be large in quantity which in turn means that the critical region in the figure above will be narrower thus yielding a lower p-value.

Intuition behind the concept of p-value

Think about this intuitively. Let’s say we have a drug named ‘Drug A’ that is in clinical trials evaluating it’s efficacy against the existing standard of care (SoC) for a hypothetical disease. Imagine the scenario that ‘Drug A’ is actually better than existing standard of care. Here if the following are the null hypothesis and alternate hypothesis:

Ho : Drug A is not better or worse than SoC

H1: Drug A is better than SoC

Since we imagined the scenario that Drug A is actually better, our clinical trial statistical analysis should reject the null. Keeping this in mind, let’s imagine that the test-statistic that we got from the clinical trial statistical analysis is ‘x’. Intuitively, since we ‘know’ that Drug A actually works, the absolute value of this test-statistic ‘x’ should actually be on the larger side. This means that the pink shaded area in the above figure, would actually be pretty narrow because your test-statistic is a rather large value. Without using any p-value, you can intuitively say that Drug A works because the test-statistic is quite large. But you would also want to know how likely is that you got your test-statistic by some random misfortune if the null is true. If indeed the null was true (i.e., Drug A is no better or worse than SoC), you don’t want to reject the null, right? If you reject the null, this then becomes a type 1 error (a false positive result). So you would want to know what is the probability of obtaining a result at least as extreme as x (test-statistic). Since ‘x’ is a rather large value, the pink shaded area (critical region) will be pretty narrow meaning this probability would actually be pretty small. This probability is nothing but the p-value. Let’s say this value is 0.0001. What all these means is that even if your drug was not working (null being true), your probability of obtaining a large treatment effect like ‘x’ due to some random misfortune (or God, as some may call it) is only 0.0001. This in turn intuitively means that it is highly unlikely that this effect that you obtained by analysing your clinical trial data is unlikely to be due to some random misfortune. Do note that interpreting p-value by using only this misfortune (random chance) is not a good practice, although I find it okay for practical purposes because it helps with the intuition behind all these stuffs. I discuss more about this ‘chance’ a bit later.

Effect Size

It is also important to understand the term ‘effect size’ in this context. Jacob Cohen once famously said that the primary product of a research inquiry is one or more measures of effect size, not p-values’. In the context of an epidemiological study that compares effectiveness of an intervention (eg:- treatment vs placebo), effect size is a magnitude of the treatment effect between groups. Absolute effect size is simply the absolute difference of effect between the groups and does not consider the variance or sample sizes and hence for valid comparisons effect size is always adjusted for variance and sample size during calculation. Like test-statistic, effect size is also a single numerical quantity and a larger effect size means a larger difference between the groups. Some of the commonly used indices of effect sizes in epidemiology are Cohen’s d, odds ratio and relative risk ratio¹⁰.

A larger effect size can tend to yield small p-values, but it depends on sample size. For large sample sizes, it is possible to get smaller p-values like < 0.05 even when the effect size is less. Now a very important point to note here is that a smaller effect size does not usually translate to clinically meaningful difference. This is one of the major reasons why some of the blatant misuse of p-values have occurred. A widely cited case is the famed Physicians Health Study where the objective was to examine the effect of aspirin in preventing myocardial infarction (MI)¹⁰. The study¹¹ had about 22,000 subjects and it was found that aspirin was associated with a reduction in MI over the 5 years usage with a highly significant p-value which was less than 0.00001. However, the risk difference, which is one of the commonly used effect sizes, was quite low (~ 0.77%) and such a small difference means there was no real clinically meaningful effect. This also lead to many people being at risk for adverse events due to aspirin.

Most of the ‘low-hanging’ fruits have already been plucked and it is becoming more and more difficult to discover and develop drugs with reasonable treatment effects. This is the classic ‘better than the Beatles problem’ (check out Eroom’s law).

Now that some of the mathematical/statistical nuances of the p-value have been detailed, it makes sense to list a few important statements about p-value…

  1. p-value is not the probability that your test result occurred by chance alone.

p-value is a cumulative probability under the assumption that null hypothesis is true. It only indicates the probability of obtaining a result at least as extreme as your result when the null hypothesis is true. However, I feel that it doesn’t hurt much to think that p-value is the probability that your test result occurred by chance alone for practical purposes. But, it is important to know that this doesn’t convey the actual scientific truth. As someone said, the devil is always in the details.

2. p-value is not the probability that the chance of your test result being true is (1-p) %

p-value does not indicate anything about the test result being true (or false for that matter). The existence of p-value is valid only under the assumption that the null hypothesis is true and all its definitions and interpretations can only be defined within that assumption. Both the statements 1 and 2 are very well captured by the following expression:

Prob(Observation from data | hypothesis ) is NOT equal to Prob (Hypothesis| Observation from data)

This figure is taken from wikipedia. Credits to the author is as follows: By User:Repapetilto @ Wikipedia & User:Chen-Pan Liao @ Wikipedia — File:P value.png, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=36661887

3. p-value of less than 0.05 is not indicative of clinically meaningful benefit or effect.

Again, a p-value is only a tool which can guide in rejection of null hypothesis. It is not a measure of how strong the alternate hypothesis is and certainly not indicative of clinically meaningful benefit or effect which depend more on effect sizes. Rejecting null hypothesis using p-value is still possible even when there is no clinically meaningful effect.

4. p-value alone should not be used for arriving in conclusions.

Conclusions out of hypothesis testing is only possible when there is transparency in the research methodology as well as data. Transparency and clarity allows for a holistic approach in analyzing study results by examining multiple factors such as bias, robustness of the statistical method, confounding, effect sizes, confidence intervals and p-values. Post-hoc sub-group analysis, if done, should be properly¹² reported¹³.

Two major suggestions proposed by scientific community to tackle the misuse of p-values

I now briefly touch upon two suggestions put forward by scientific community to prevent the misuse of p-values. This is definitely not a comprehensive list, and there are also suggestions like using credible intervals (Bayesian statistics). Those who are interested in alternatives and/or ‘add-ons’ to p-values, can get a lot of reference by going through this article by American Statistical Association⁶ (Reference no. 6)

  1. Reporting of effect sizes

Reporting of effect sizes is probably the most well received suggestion to curtail the misuse of p-values. The use of effect sizes in planning, analyzing, reporting and understanding of research studies has always been put forward by prominent scientists such as Jacob Cohen¹⁰. Effect size gives an indication to the size of the effect and when used together with p-values can provide a more comprehensive information about the research output to the readers as well as policy makers¹⁴. Propositions of guidelines¹⁵ for calculating and reporting effect sizes in biomedical research has gained momentum in the last few years¹⁶

2. Redefine the statistical significance for new discoveries with more stringent thresholds

Most of the biomedical fields accept the statistical significance threshold at 0.05. This has shown to result in lack of reproducibility due to false positive findings and there has been a proposal to redefine statistical significance to 0.005 for new discoveries which can improve reproducibility¹⁷. Some other fields have already adopted much more stringent thresholds such as particle physics. One notable example is the discovery of Higgs-Boson where the scientists used a very low threshold of 0.0000003 (1 in three and a half million) for ascertaining statistical significance 18. It has to be noted that this redefinition can lead to an increase in false negatives even though it improves the false positive rate and in order to balance that sample size will have to be significantly increased which might not be a feasible option. Also, this redefinition is not an alternative to NHST.

Conclusion

We should not put all our hopes on ‘p-value’ thinking that all results ultimately depend only on it. Intuitively also, it does not make sense to think that significant decisions involving life and death (clinical trials, for example) has to be based on a single number.

Misuse of p-value is getting more and more scrutinized now more than ever. One of the major reasons for this misuse is thelack of proper understanding of p-value. Epidemiologists, physician scientists and policy makers need to be cognizant of important statistical concepts in null hypothesis significance testing and should never rely solely on p-values based on a mere threshold to make scientific judgments. This is now more important than ever as we are in an era of big data which makes it much easier now than ever for conducting very large observational studies thus expanding the scope for p-hacking even more. (Reference no. 13 is a great read about p-hacking).

Another reason why there is this obsession/dichotomania about the p-value lies with the attitude that prevails in our scientific community itself. This is something that I did not discuss in this article until now. I might be turning a bit judgmental here as this is an opinion that I arrived by myself through anecdotal personal experiences. I have to admit that I myself have been guilty of many relentless and sometimes incessant/desperate pursuits of statistical significance. I also am absolutely sure that I was not alone. I used to believe that not getting the statistical significance means that there is no hope to my research or project and that it is a lost cause. I now believe that this kind of attitude is something that has to change. Of course, for this to change, the ‘pressure to produce significant results’ should also change. That is a whole different topic to discuss altogether. I also have to admit that I am no longer involved in active academic research and hence my ‘frontline’ experience in that regard is definitely not substantial enough to make any conclusive remarks about the attitude prevalent in academic research . That being said, I believe it is very much important that everyone must learn to gracefully accept the possibilities of uncertainties, failures and variations in scientific research. The recent rise against the dichotomous trend in using p-values will hopefully help in spreading this awareness.

References

1. Ioannidis, J. P. A. What Have We (Not) Learnt from Millions of Scientific Papers with P Values? Am. Stat. 73, 20–25 (2019).

2. Chavalarias, D., Wallach, J. D., Li, A. H. T. & Ioannidis, J. P. A. Evolution of Reporting P Values in the Biomedical Literature, 1990–2015P Value Reporting in the Biomedical LiteratureP Value Reporting in the Biomedical Literature. JAMA 315, 1141–1148 (2016).

3. Amrhein, V., Korner-Nievergelt, F. & Roth, T. The earth is flat (\textit{p} > 0.05): significance thresholds and the crisis of unreplicable research. PeerJ 5, e3544 (2017).

4. Badenes-Ribera, L., Frías-Navarro, D., Monterde-I-Bort, H. & Pascual-Soler, M. Interpretation of the p value: A national survey study in academic psychologists from spain. Psicothema 27, 290–295 (2015).

5. Trafimow, D. & Marks, M. Editorial. Basic Appl. Soc. Psych. 37, 1–2 (2015).

6. Wasserstein, R. L. & Lazar, N. A. The ASA’s Statement on p-Values: Context, Process, and Purpose. Am. Stat. 70, 129–133 (2016).

7. Amrhein, V. Scientists rise up against statistical significance. Nature (2019). Available at: https://www.nature.com/articles/d41586-019-00857-9. (Accessed: 31st March 2019)

8. Statistical Inference in the 21st Century: A World Beyond p < 0.05. The American Statistician Volume 73 (2019). Available at: https://amstat.tandfonline.com/toc/utas20/current. (Accessed: 31st March 2019)

9. Hypothesis Test Simulator for Standard Normal Distribution. Available at: http://tananyag.geomatech.hu/m/32282. (Accessed: 31st March 2019)

10. Sullivan, G. M. & Feinn, R. Using Effect Size-or Why the P Value Is Not Enough. J. Grad. Med. Educ. 4, 279–282 (2012).

11. Bartolucci, A. A., Tendera, M. & Howard, G. Meta-Analysis of Multiple Primary Prevention Trials of Cardiovascular Events Using <em>Aspirin</em>. Am. J. Cardiol. 107, 1796–1801 (2011).

12. Wang, R., Lagakos, S. W., Ware, J. H., Hunter, D. J. & Drazen, J. M. Statistics in Medicine — Reporting of Subgroup Analyses in Clinical Trials. N. Engl. J. Med. 357, 2189–2194 (2007).

13. Head, M. L., Holman, L., Lanfear, R., Kahn, A. T. & Jennions, M. D. The extent and consequences of p-hacking in science. PLoS Biol. 13, e1002106–e1002106 (2015).

14. Lakens, D. Calculating and reporting effect sizes to facilitate cumulative science: a practical primer for t-tests and ANOVAs. Front. Psychol. 4, 863 (2013).

15. Kim, H.-Y. Statistical notes for clinical researchers: effect size. Restor. Dent. Endod. 40, 328–331 (2015).

16. McGough, J. J. & Faraone, S. V. Estimating the size of treatment effects: moving beyond p values. Psychiatry (Edgmont). 6, 21–29 (2009).

17. Benjamin, D. J. et al. Redefine statistical significance. Nat. Hum. Behav. 2, 6–10 (2018).

18. van Dyk, D. A. The Role of Statistics in the Discovery of a Higgs Boson. Annu. Rev. Stat. Its Appl. 1, 41–59 (2014).

--

--

Dr. Dennis Robert MBBS, MMST
The Startup

Healthcare Data Science Professional, Physician (currently not practising). Alumni of IIT Kharagpur & Medical College Kottayam. Khorana Scholar, AIPMT Top 150