Changing the default p-value threshold for statistical significance ought not be done, and is the least of our problems.

Shockingly, scientists’ claims largely fail to replicate, including claims with expensive social, economic and health consequences. This article is about how we know we are supposed to do science, and about a new proposal to shift our target. First, what we know would work if we did it:

Fisherian Science

Last century, it was realized that experiments seldom yield unambiguous outcomes. Instead, results are probabilistic: typically, researchers express this in terms of a known baseline: how their outcome measures would have varied if the treatment had no effect at all. The great R.A. Fisher, responsible for much of this ground-breaking work, suggested that if a result had a probability of less than 5% of occurring under this “null hypothesis” (the iconic p < .05), the researcher should feel encouraged to pursue the finding in more depth (and if not, they, and others, should feel encouraged to abandon the idea).

Academic science has seen this scenario go wildly wrong. Fisher and others told us that if we make multiple tests, we must adopt much more stringent criteria to control the true critical p value at .05.

Fisher of course assumed also that scientists had not inspected the data, with only the “working” measures reported (known as p-hacking).

It was assumed also that we would share our failures: roads down which others should not walk. Fisher invented a version of meta-analysis to combine these results into a system-wide estimate of support. Researchers hiding these data on their computer disks (known as publication bias) breaks the assumptions.

https://en.wikipedia.org/wiki/File:Fisher-stainedglass-gonville-caius.jpg

Finally, Fisher assumed studies would be run with adequate power. Fisher began from commercial agriculture, where seeds grown in the half-dozen fields typically tested would, given his invention of randomization, reliably reveal as significantly superior any variety or treatment with a commercial benefit. We know though, that most true effects and typical study sizes in science are small enough that power is an issue – much of functional-neuroimaging, for instance, is grossly underpowered.

The answer to these problems seems clear to many for decades now: we need to follow all of what Fisher said, not just practice a .049 cargo cult.

  1. Most importantly, we must adjust the critical p-value to an effective .05. Given human nature, we likely should adjust for all possible tests of all variables measured. GWAS studies do this, setting their p-value for a significant genetic effect at 5×10−8 (p = 0.00000005)! These Bonferroni-type adjustments are well known in social science, but under-used. Note: Geneticists didn’t move the critical value, they simply controlled it at a true .05.

2. We must require a declaration that p-hacking and related practices have not occurred.

3. We must write-up and share all our results, not just the ones that work, and reputable journals must publish them prominently, independent of the result.

4. We must require adequate (80%+) power.

Critically, a second strand of work, associated with likelihood, has focussed much more on what we should be testing, concluding that the relevant comparison for a result is hardly ever the null hypothesis. Rather, models should be compared against their leading competitor’s predictions. This leads to likelihood tests where simply being better than chance is almost never sufficient for a model to emerge as the preferred model.

In 2017, these well-understood, well-validated, and powerful practices remain largely un-implemented, to the great detriment of science.

TLDR: We know what to do…

The new idea

Recently a new suggestion (well, an old solution with a lot more support) was circulated. The proposal is expressed succinctly: New discoveries would be expected to meet a standard of 𝑃 < 0.005.

That’s it.

Here I raise some responses to this suggestion.

1. Lack of a barrier to false claims. It is proposed that results not meeting the new threshold be called “suggestive.” This poses no barrier to publication, perhaps not a single word need change. Researchers can publish four p-hacked studies, each switching outcomes, processing data differently, varying the choice of covariates and statistical test (who hasn’t seen this brazenly defended in reviewing manuscripts?). Business as usual, just delete the word “significant”: this is already SOP (standard operating practice) for researchers with NS results in their preferred direction.

2. Failure to define “new”. The recommendation doesn’t define what a “new” finding is. Authors with results yielding p greater than .005 cannot be prevented from simply framing their new breakthrough as incremental, building on a body of existing (likely p-hacked & publication-biased) literature. Authors will write to editors “My finding is not merely ‘new’, it is instead ‘important’, so I adopt the p = .05 standard”. The paper title (all that most people read) remains as grandiose as is deemed permissible. Indeed, the problem with the present research environment is not that we have too many novel findings: Perhaps we have a hundred fabulous new claims each year in each field. Most are false, and we simply need funders to direct resources to independent replication to reveal these. Alongside those few sky-rockets, however, we have tens-of-thousands of me-too apparent replications of invalid findings, and very very few hard tests of existing influential-but-false theories. It is this latter mix that is so toxic: throwing gasoline on the false theory’s already bright fire, while denying entry to the firefighters. Separate from this, we have a problem of not generating remotely enough genuine new discoveries. Lowering the p-hurdle harms true discoveries, and does little to prevent false reports.

3. The new value is undefined and asymmetric in most use-cases. The recommendation doesn’t set a critical threshold for the myriad other uses of p-values than reporting a new claim. This is perhaps the biggest problem. Where Fisher set a uniform value (able to be moved based on cost-benefit trade-offs agreed on a field by field basis), this new proposal sets the value in only one, rare, case. In so doing it creates a raft of problems where a low p-value is needed to use the word “significant” but researchers can flip this on its head, dropping true effects which they don’t like by saying that this didn’t lower model-fit to the degree required by the .005 standard. Contrarily, in their next paper they can retain an effect that pleases them by using the .05 standard. We need a single symmetrical standard, applied for all claims, whether they be new or incremental, sucess or failure to replicate, as well as for comparing models and for dropping or retaining paths in models.

4. Barrier to good science. Celebrating .005 as the virtuous p-value gives carte-blanch for defenders of failed theories to demand would-be replication publishers to show they were powered at the .005 level, thus blocking publication of the failure to replicate. At the same time, the same status-quo partisan reviewers and editors can allow “successful” replications through at p< .05. Double standards will multiple our problems.

5. Omits model comparison. The recommendation is restricted to null hypothesis significance tests. A whole field of methods avoids these, preferring likelihood to determine which of two theories (each accounting for much more variance than would a null-model) is a better fit. This is an advanced science, with well understood constructs of relative and absolute fit, and information criteria such as the AIC. Should the standard be applied here? Again only to authors who use the word “new” to describe their finding?

6. Undefined for model reduction. Relatedly, how should model-reduction occur? Can we drop all paths that are >.005? Authors will now start doing this, citing the .005 guide and its impressive author list. But this will lead to catastrophically bad fitting models being published. Do we allow people using structural modeling and model comparison to avoid the .005 criterion? If so, how do we stop people barred from publishing an ANOVA at .05 from simply switching to the equivalent structural equation model?

7. Diametric opposite to baseline practice in genetics. The paper acknowledges that discovery thresholds in genomics are much more strict, but does not explain why these are not suggested here. In part, it is because the genomics approach is almost diametrically opposite to that advocated by the .005 proposal. Rather than arguing against .05 as an inferential tool (as the new proposal does), genomics adopted multiple comparison criteria precisely to preserve this .05 level of evidence. Rather than adopting a single value by fiat, genetics computed the special case that is testing genetic polymorphisms in the human genome, and set that as the objective standard. Geneticists, nearly all of whom are trained in math and statistics understood and adopted this across the field. One simply cannot publish in Behavior Genetics without a .5*10–7 p-value and (preferably) a replication for even a single polymorphism! When multiple candidates are desirable even at the cost of some number being false positives, geneticists adopted the rational “false-discovery rate” control methods. In each case, p = .05 remains the anchor for the rationale. And GWAS findings replicate extremely well.

8. Increases costs relative to benefits. In life, we seek to reduce costs and increase benefits: We set a target of maximizing net “good minus bad”. Thus we drive our better angels to the foreground. Shifting from .05 to .005 will increase the cost of studies by roughly double (thus halving the true discovery rate), leaving untouched the problems of culture, p-hacking, and all the rest that created the replication crisis. It thus increases the cost of living for honest scientists, with no clear benefit to them, or to the wider world.

It is claimed that money will be saved by researchers not having to perform future studies based on false premises. But this one-sentence claim is not substantiated. As others have noted, p-values cannot be set objectively other than as cost-benefit trade-offs of false discovery versus false negative rates. A cost-benefit analysis is therefore THE key piece of evidence about which we need information. One simple example is a massive diversion of already scarce funds away from low-cost replications which are efficiently clearing the vast blob of false findings from our text books, policy maker’s bed-time reading, and the public mind. This money will be diverted to funding… well, what? The paper does not address any changes to the research infrastructure.

The only sure outcome of raising the cost of running publishable studies is that an even smaller number of anointed researchers will win an even tougher lottery and then run massive studies of hypotheses that are almost certainly wrong.

Nowhere are the studies showing that the researchers with reputations built on funded-but-false findings, will not simply take this doubling in their grant budgets, (relishing the extinction of their competitors), and watch as their post-doc teams run larger studies, with equal access to p-hacking, outcome switching, failure to steel-man competing theories, fail to include active controls, continue to take advantage of white-hat and publication bias. They will then happily publish an even larger slice of the greatly diminished literature, and place their students into key tentured positions based on this even less competitive publishing environment. These acolytes will then proceed with p=.05 me-too studies, adding the icing of “incremental validity” to this fake-science cake, while their foes, now de-funded, are unable even to get those papers they can complete published unless they show that any failures to replicate are powered at 95% power to obtain p=.005 for effects a fraction the size of those originally reported.

These are the details needed to spelled out and demonstrated viable for a major, indeed seismic, shift in science practice.

Conclusion

The target paper serves as valuable, visual, reminder of how many false positives p = .05 generates, even when studies are well powered and people are honest. Based on reasonable estimates of how often scientists are testing wrong ideas, perhaps as many as 1/3 of studies with p < .05 are false-positives. But Fisher and others knew the remedy to this: Replication and cumulative, meta-analytic science. Feynman knew this too, when he argued that psychology wasn’t being practiced as a science precisely because it was routine to assume too much was true, rather than to replicate before extending.

This stoical and historic strategy of seeking independent replications and accumulating evidence across studies is highly robust. Pons and Fleischman’s cold fusion claims didn’t need .005 to be interesting, and they did need independent (failed) replications to debunk them. And that: a fiery, open debate of ideas, based on data, is what we need.

Getting there requires increased boldness of researchers to contest powerful ideas, changes in how we decide who and what to fund, assiduously writing-up failed results, and publishers accepting these and, like effective newspaper retractions, giving these failures the same prestige they lent to the original false-reports. This is how we will get to better.

An argument about null probabilities will, for most readers, simply undermine the very idea of hypothesis testing and is a risky distraction at an important juncture in the history of science.