What “Statistically Significant” Really Means.

Matthew Christopher Bartsh
Nerd For Tech
Published in
5 min readJul 14, 2021

When an academic journal says that a study obtained a statistically significant result, it doesn’t mean what most people think it does. It doesn’t mean the result was important or consequential. Here, “significant” is being a used in a jargon sense that is very different from the everyday sense of “important”.

What it actually means is that even if the study was flawless in every respect (in practice, highly unlikely, so this is a very big “if”) there is (if the standard five percent significance level is used) a one in twenty chance that the results were due to chance alone.

It says nothing about the size of the effect, or the importance of it, nor even about the odds of the effect being real. Not many people realize this last bit. You see, the odds of the effect being real depend on so many things, including the context of the study.

Consider the situation where a pharmaceutical company carries out study after study until they finally get a study that shows what they want to show (at the standard five percent significance level, this will be on average on the twentieth study) and then quietly files and forgets the other nineteen studies that failed to show what the company wants their study to show. The company (and this is quite legal — see Ben Goldacre’s book called “Bad Pharma”) then publishes only the study that shows what it wants to show.

Each and every study could be flawless when considered separately, and yet the significant result of that twentieth study of course shows nothing at all, except perhaps that the law of averages is working as usual. Probability is an extremely subtle thing and this makes it very easy for a savvy person or company to make totally misleading statements without actually lying.

The same principle applies when different teams of scientists try to prove something they all believe. Finally, one team gets a significant result, and publishes it, without saying anything about all the failed attempts at getting a significant result.

Here is an xkcd cartoon about this phenomenon:

This is just one of an infinite number of ways that you could have a meaningless “significant” result.

Other ways include deliberate bias, accidental bias, an error in the logical foundation of the experiment (perhaps there accidentally, or “accidentally on purpose”), an error in the collection of the data, or an error in the analysis of that data.

And anyway, as mentioned at the start, that an effect is “significant” does not mean that it is big enough to be important. That it is “significant” says nothing about the size of the effect.

So even if the study is flawless and there are no subtle problems at all, such as the existence of other studies looking at the same (or an equivalent) thing, and so there is no problem with the use of the word “significant” and it means exactly what statisticians think it means (in this — to the layman — highly misleading sense), there is still a huge problem that completely destroys the real, practical value and meaning the finding. The problem is that the effect size could be infinitesimal.

You see, with a big enough sample size, if there is *any* effect present (and there nearly always is — see my article about how everything is correlated to everything), it can be shown at any significance level to be present. So all that “significant” means here is that there is only a one in twenty chance (if the study was flawless and so on) that there is the *opposite* effect.

This is because the result is really showing that that there is a 95% probability that there is an effect in a certain (call it “positive”) direction that is zero or greater in size. This is the same as showing that there is a 5% chance that there is a negative effect.

An example will make this clear. If a study shows that taking a certain drug causes weight gain, but the effect size (and the sample size)is not known to us, then even if we know that the study was flawless and had no subtle associated problem, we really have no reason to think the drug causes more than zero weight gain. All we can rationally conclude is that it does not make you lose weight. That’s because we have not been told how much weight was gained by each person, and so we know it could have been infinitesimal, and for practical purposes, zero.

Paradoxically, the larger the study, the less interesting a “significant” result is. A small study cannot detect a weak effect. A large study can. And a large enough study can detect effects so weak (tiny) as to be unimportant. It is unimportant that a cancer drug does not cause weight loss, for example.

But an academic journal will say, “a large, well-carried out study found that drug A caused significant weight gain”.

There is movement, that I wholeheartedly support, to banish statistical significance (also know as “p-values”) from science (and pseudoscience masquerading as science). See this Nature article for details (it’s free):

There seems to be a need for the few who understand what statistical significance really means to share this knowledge, which is why I wrote this article.

A chain is only as strong as its weakest link. Statistical significance is only one link in a very long chain. It’s like saying “the last link in the chain has a one in twenty chance of being useless” while saying nothing at all about the other links. How is that a recommendation of the chain? And yet, that is exactly what is happening when a scientist says, in the jargon of statistics, “the results are significant”.

Photo by Edge2Edge Media on Unsplash

--

--

Matthew Christopher Bartsh
Nerd For Tech

I always follow back. I usually follow anyone who makes an interesting or okay response to one my articles. I often clap. I never give fewer than fifty claps.