The Poisoned Well of Psychology

I went back and forth all weekend debating whether I should write a blog about “tone” (#tone) or about p values. After a spirited conversation with my wife yesterday, I decided to go with….
p values!
There has been a lot of hub-bub over p values recently, mostly due to the recent paper published by Daniel Benjamin et al. (and I really mean et al…it has 72 authors! here is the preprint! https://osf.io/preprints/psyarxiv/mky9j).
In this paper, a lot of really smart people make the suggestion to revise the significance threshold from p = 0.05 to p = 0.005. They argue that this will reduce the false positive rate to actually be 5% and that will help reduce false-positive findings in the scientific literature and help resolve the replication crisis.

First, I want to say that I really like a lot of the authors on this paper. E-J. Wagenmakers’ response to the Daryl Bem “Feeling the Future” paper is what led me into the Open Science movement to begin with. Brian Nosek and Simine Vazire helped lead and organize the amazing Society for Improving Psychological Science (SIPS) conference last week and are working very, very hard to improve psychology. I cannot overstate the respect and admiration I have for a lot of these authors (and unfortunately for me, I don’t know all of them, but if I did, I’m sure I’d respect and admire them too!).
That said, I cannot support this suggestion to revise the p value threshold to p = 0.005. Here’s my reason why.
My primary problem with the suggestion is the underlying notion that the strength of evidence can be measured with p values. The sub-header from the paper is literally “strength of evidence from p values”. P values do not measure the strength of evidence for the alternative hypothesis. I’ll repeat that and make it bold:
P values do not measure the strength of evidence for the alternative hypothesis!
The p value simply tells you the probability of observing data as extreme or even more extreme as your collected data under the true null hypothesis. Here’s an example: You have a null hypothesis: “There is no height difference between adult men and adult women”. You then test this by measuring the heights of 50 adult men and 50 adult women. You wind up with two distributions of height, one for each gender, and compute a t test. (t = (m1-m2)/SEd). You calculate a t value of 4 with a degrees of freedom of (n1+n2)-2 = 98. You look at a t table and see that your calculated t value of 4 is greater than the critical value of t, so you reject the null hypothesis.
What does this mean? It means that it is unlikely that your observed difference in height can exist when your null hypothesis “there is no height difference between adult men and adult women” is true.
That leaves two options: Either the null hypothesis is false or the data is erroneous. Typically, scientists say the null hypothesis is rejected and the alternative hypothesis “there is a hight difference between adult men and adult women” must be the case. Since the p value tells you nothing about this alternative hypothesis, it takes a jump of logic to get to “this data supports this alternative hypothesis”. The “data is erroneous” bit sometimes gets mentioned in conversations about validity and sampling bias, but for the most part, a lot of researchers assume their collected data is representative. This is not always a great assumption to make…
Over time this has blown up into scientists finding a significant p value and using that to support their claim of whatever their alternative hypothesis is, typically without erring on the side of caution. This misinterpretation (this behavior by scientists) of what a p value means is a huge problem in science right now and is one major reason for the increase in false positive findings in psychology.
So the new paper suggests reducing the p value threshold for significance from 0.05 to 0.005. In terms of reducing false-positive findings, it will work, and I support reducing the number of false-positive findings.
However, the authors use Bayes Theorem to support their decision. They state that since the typical alternative hypothesis in Psychology has a prior odds of x (which they model as 1:5, 1:10, and 1:40), then those prior odds should be taken into account when making claims on hypotheses. This I cannot agree with, mainly due to the fact that Bayesian approaches to data analysis already exist, and can be used if chosen by investigators. Inverse, subjective probability does not need to be the foundation of every analysis, especially not frequentist approaches.
Secondly, these prior odds are just suggested models of what the prior odds for an alternative hypothesis could be. Basing your prior odds on previous literature can be tricky due to publication bias (making it look like an effect is more likely than it might actually be). Most Bayesian approaches use uninformative prior odds and use a Bayes Factor cutoff of 3, which corresponds to a p value near 0.05. If the authors want to reduce the p value cutoff to 0.005, they should also recommend increasing the Bayes Factor cutoff to something near 27 (which is missing from their paper).
The problem with p values and false positives in the literature is not the statistics used, but the interpretation being made by the investigators. As my wife (and smartest psychologist I know) Mariana put it this weekend, Psychology is a well of poisoned water. The argument over p value thresholds is like arguing over the color bucket being used to gather the water.
Many psychologists, and I can probably count myself in this group, like to use math and statistics to legitimize our research. The more quantifiable our work, the more real it is…right? This has led many to focus on data analysis more than the data collection and overall setup and methodology of their work. You cannot correct garbage science with a multi-level model, or a mediated-moderation analysis, or with a Bayesian graphic models.
Good science starts at conception and continues with every decision made by the researcher. Statistical analysis is just one step of a large, multi-step process. At each point, the quality of the work can erode unless the investigator is careful and thorough in decision-making from beginning to end. The way forward is by targeting researcher behaviors, not p value thresholds.

