If not p values, then what?

For about a century, people have been using p values. Usually incorrectly. For almost a century, but especially over the last 20 years, a lot of statisticians (including me) have been railing against p values and their misuse. For a true battle cry against p values, I suggest reading The Cult of Statistical Significance: How the Standard Error Costs us Jobs, Justice and Lives by Deirdre McCloskey and Steven Ziliak, which came out in 2008.

More recently, the American Statistical Association joined the bandwagon, issuing a statement noting that p values and statistical significance are overused and misused. You can see it here. But here are 6 points:

P-values can indicate how incompatible the data are with a specified statistical model.
P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
Proper inference requires full reporting and transparency.
A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

It’s much milder than I (or some others) would like, but at least it goes in the right direction. Even more recently, a large group of scientists signed a statement opposing p values and published it in Nature. It is much stronger than the ASA statement and much more what I agree with. It called for “ the entire concept of statistical significance to be abandoned.”

But if not p values, what? You can read the Nature article for their suggestions, which are well thought out. The rest of this piece is my own thoughts.

First, we need to think. We should not make any decisions based on any arbitrary threshold set up by a man 100 years ago — even if that man is Ronald Fisher. Even if we accept the general notion of statistical significance, always picking 0.05 is ludicrous. We also usually ask for power of 0.8 or 0.9. If we do these two things, we are saying that type I error is twice or four times as bad as type II error. That might be so but it might not. A type I error means rejecting the null when it is true. A type II error means failing to reject the null when it is false. So, for instance, suppose you have invented a drug that you think cures a previously incurable and terminal illness. Then a type I error would mean giving something useless to people who are dying while a type II error would mean letting people die who could have been cured.

Second, we should recognize that what is most important from typical statistical output is the effect size and an estimate of its precision. Effect sizes are things such as:

  • How many people will be cured?
  • How much money will be made?
  • How big is the difference between two or more groups?

and so on. Whether we are scientists or business people, these are clearly things we are interested in. Today, with big data everywhere, we can often find things to be highly statistical significant (i.e. a very low p value) with only a tiny difference. But, even today, there are some areas (like treating very rare diseases) where a very large effect might not be significant. Then there are measures of how good our guess is. “The new ad will make us $5,000,000” is much less impressive if we have to add “or maybe it will lose us $2,000,00, or make us $10,000,000”.

Third, we should use the MAGIC criteria introduced by Robert Abelson in Statistics as Principled Argument (the first two are covered in the preceding paragraph):

  1. M: Magnitude — how big is the effect?
  2. A: Articulation — how precise is our estimate of the effect?
  3. G: Generality — how widely does it apply?
  4. I: Interestingness
  5. C: Credibility — incredible claims require incredible evidence.

Fourth, we should recognize that one of the main goals of statistics is to separate signal from noise. We should test our methods on noisy data to make sure they don’t find signals that are not there and we should test our methods on data with a known signal to make sure they find signals that are there.

Fifth, we should recognize that any conclusion we come to, from any data and by any method, might be wrong. We should try to estimate the chance that we are wrong and look at the consequences of being wrong in different ways.

All this is a lot harder than saying “p = 0.02! Significance!” But it’s also a lot more meaningful and a lot better.