Redefining Statistical Significance: Can a .005 alpha level cure Significance Testing?
In a new paper currently awaiting peer review and publication, 50+ researchers from institutions across the globe argue that manipulating the alpha level would not suffice to solve the real issue with significance testing. The authors make a case that regardless of the threshold, depending on p-values to reject the null hypothesis is “deleterious for the finding of new discoveries and the progress of cumulative science”.
Here is a summary of the key points with some links to supplementary information:
About the paper: Manipulating the Alpha Level Cannot Cure Significance Testing
The paper is in pre-publication phase: available for early communication or feedback before peer review ( Trafimow, D., Amrhein, V., Areshenkoff, C. N., Barrera-Causil, C., Beh, E. J., Bilgiç, Y., … & Chaigneau, S. E., 2017).
The team effort, lead by David Trafimow, is supported by a group of researchers based all around the world, with expertise ranging from Medical and Surgical Sciences, through Psychology, Cognitive Science, Math and Statistics. The list of authors spans across Australia, Bulgaria, Canada, Chile, Colombia, Germany, France, Italy, USA, New Zealand, Russia, Japan, and many more (including 2 of the best professors at my home university).
Why change the alpha level from .05 to .005?
The Trafimow paper is a direct response to a proposal by Daniel Benjamin to Redefine statistical significance (Benjamin et al. 2017). In a nutshell, Benjamin proposes a change to .005 threshold because according to him:
- A two-sided p value of .005 also corresponds to Bayes factors between approximate 14 and 26 in favour of H1: a range indicating substantial to strong evidence, according to Bayesian classifications
- There is evidence that low statistical power and α = .05 combine to produce high false positive rates. A more conservative standard of .005 would thus reduce the false positive rate to reasonable levels.
To begin with, it is true that by Frequentist logic the probability of Type 1 error is indeed lower at p=.005 and, all things being equal, by Bayesian logic the null hypothesis is less likely if p=.005. However, there are deeper issues regarding the test of significancebased on rejecting the null hypothesis.
Why Significance testing is not a solution?
Null hypothesis significance testing procedure (NHST)critics argue that p does not provide useful informaiton on the probability of the null and the alternative hypothesis, a valid way to disconfirm chance observations or a valid index og effect size and generalizability of the findings and their probability of replication (Trafimow and Earp, 2017). The p value is, instead, better suited to control for Type 1 errors: still an important goal of statistical research.
Regression and Reliability (Trafimow et al. 2017)
As the authors point out, p-values have a sampling distribution and whether the p-value obtained in any given experiment passes the alpha level is a “matter of luck”: with large effect and sample sizes the value decreases.
Without large effect and sample sizes, a p-value below the significance threshold is unlikely to be resampled upon replication. The Regression to the mean phenommenon suggests that if a variable is extreme upon first measurement, it will tend to get closer to the average upon subsequent measurements (and vice versa).
By calculating the correlation between p-values in an original studies cohort vs a replication cohort Trafimow obtained a value of .004*. The prediction here is that in replication experiments the p-value is much closer to the mean of the p-value distribution than to the p-value obtained in the original experiment. In studies with low power, large sample effects are possible and since they are overestimates, using a lower threshold of .005 might guarantee even larger overestimates.
What is more, the correlation value of p= .004 from the comparison between original studies and replications is by itself an indication that p-values are a poor foundation for binary decisions and are incapable of providing valid measurement of evidence strength.
*Take a look at the data yourself, at the The Reproducibility Project RP:P
Other Issues covered in the paper:
- The relative importance of Type I and II Errors differs across studies, areas and researchers and choosing a blanked level across multiple areas is inadvisable
- Defining criteria for successful replicability
- Assumptions about random sampling from a population and independence are rarely true
- The Population Effect Size plays a key role in obtaining statistical significance in the original study (larger population effect size => larger sample effect size) and in successful replication. Choosing p= .005 would not lessen the importance of the population effect size and will instead increase it, unless sample sizes increase substantially from the ones typically in use
- Replication depents to a large extent on sample size but large sample sizes are no always possible due to associated costs, underappreciation of how much the sample size matters and publication incentivizing for publishing novel results over reliably replicable ones.
Alternatives to Significance Testing
- Focus on sample size as an a priori procedure
- Focusing on other tools from the “Statistical toolbox”(the authors suggest confidence intervals, equivalence tests, Bayesian methods or information criteria, etc.), while keeping in mind that no alternative can provide ready to use and clear cut answers
- Relying on cumulative evidence from multiple independent studies instead of single studies and replications from the same lab