After all, Hypothesis testing is all about testing observed data.
Scientists have long been relying on the p-value to support academic arguments they make. Whether a medicine has a therapeutic effect, or whether a new education system increases the quality of education was inferred from statistical analysis methods that involved calculating the p-value. However, the p-value had been misunderstood as a criterion that determined whether a hypothesis was accurate, at least until recently. The misinterpretation of the p-value allowed easier analysis and easier interpretation of the study results. In order to publish a paper, all scientists had to know about statistics was the statistical software packages and the rule-of-thumb: The p-value less than 0.05 indicates significant effect (of the medicine, or the new education system), whereas the p-value greater than 0.05 indicates no significant effect. However, the p-value has been gaining attention due to the recent replication crisis occurring in the area of psychology. Recently, Brian Nosek and his colleagues (2015) repeated 100 published experiments to see if their results could be replicated. Among 100 experiments (including 97 experiments that reported statistically significant results), only the one-third of the studies were replicated.
What is p-value, and what does it have to do with the replication crisis? The p-value is a parameter that tells us the extent to which an observed data matches the hypothesis. When we perform statistical analysis, we come up with hypotheses called the test hypotheses which provide predictions about samples, or population.
Let’s say that we want to test if medicine A is a better cure for the swine flu than medicine B. There could be two hypotheses in this case. The first test hypothesis is “The effect of the medicines A and B are not different.” which is the null hypothesis meaning that it predicts no difference or effect. One can also make a hypothesis that says “The effect of the medicine A and B are different.”, which is an alternative hypothesis. Usually, the null hypothesis is tested when we perform a statistical analysis because the prediction of the null hypothesis is more straightforward (effect or no effect) than the alternative hypothesis (Which one has the larger effect? How large is the effect?; “Null Hypothesis.”, n.d.). Moving back to our discussion of the p-value, our test hypothesis and its prediction would be “There will be no difference between the medicines A and B”. In turn, the p-value informs the extent to which the observed data matches the hypothetical dataset created based on the prediction of the test hypothesis. If the observed data barely matches the hypothetical data, the p-value will decrease which indicates that the data pattern is different from the typical data pattern we can observe when there is no difference. Therefore, the p-value is not an absolute criterion that divides the wrong hypothesis from the right hypothesis. Rather, it is a parameter that informs us how similar an observed data pattern is to the hypothesized data pattern that we would like to reject.
If you can remember the reading from last week (James et al., 2013), the models identified using regression analysis were fitted with the observed data to test if the model was accurate. The process of calculating the p-value does the opposite of the model testing. The p-value calculation involves fitting the observed data with the test hypothesis to see if the data matches the prediction of the hypothesis, not the other way around. Therefore, the p-value calculation can be considered as a data testing process.
The problem with scientists simplifying the meaning of the p-value was that it made them rely on the p-value and believe the myth of the p-value. As suggested by Greenland et al. (2016), the myth of the p-value might have helped scientists apply the complex statistical analyses with the cost of little cognitive efforts. However, the myth led many scientists to inaccurate conclusions. Moreover, scientists suddenly seemed to not care about the basic assumptions of statistical analyses which need to be satisfied in advance. As a result, the conclusions and results from many studies became less likely to be replicated, because they were made based on inaccurate assumptions and the definition of the p-value. Unfortunately, scientists are not going to give up the p-value any sooner, according to Gelman (2016). Greenland et al. (2016) suggest several principles to prevent potential problems of the p-value. These suggestions include accurate interpretation and the examination of the p-value, appropriate use of the confidence interval to examine the hypotheses, and thorough check on the assumptions of analysis techniques. Altogether, these principles point to the fact that an appropriate statistical practice starts from scientists’ understanding of the analysis techniques and openness to the criticism as also pointed by Gelman (2016).
Null Hypothesis. (n.d.). Retrieved from http://psc.dss.ucdavis.edu/sommerb/sommerdemo/stat_inf/null.htm
Gelman, A. (2016, September 16) Statistical Modeling, Causal Inference, and Social Science. [Web log post] Retrieved March 8, 2017, from http://andrewgelman.com/2016/09/21/whathashappeneddownhereisthewindshavechanged/
Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., & Altman, D. G. (2016). Statistical tests, P. European journal of epidemiology, 31(4), 337–350.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 6). New York: springer.
Open Science Collaboration. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716.