Three Things You Need To Know About Multiple Testing

Statistical tests are often used in scientific research. The results typically come in the form of p values that tell you how likely it is that you would have gotten at least as extreme results had there been no difference in the population.

The problem of multiple testing (or multiple comparisons) arises when you do a large number of statistical tests. The more tests you do, the more likely you are to get something that stands out, even if it was just a fluke.

If you take these hits and publish them or put them into a press release, you can create the appearance of a finding that just isn’t true.

#1 Anti-science activists often abuse multiple testing

Unscrupulous researchers can trawl through dozens or even hundreds of tests and show you the hits but hide away both the misses and the fact that they did so many tests to begin with.

That way, they can find something that appears to support their position, even though it was probably just a fluke. This has been documented among many different anti-science groups, from anti-GMO activists to proponents of alleged paranormal powers.

#2 Part of the replication struggle

Science is self-correcting. This means that errors and wrong conclusions that are proposed are critically investigated, challenged and hopefully fixed.

A certain proportion of published scientific findings cannot be replicated for a variety of reasons. There might be methodological problems, the findings might only applying to some populations and not others, they might be testing low probability hypotheses, there might be statistical errors and so on.

One part of this replication struggle is that fluke findings are highlighted by a failure to correct for multiple testing.

#3 Here’s how to correct for multiple testing

First, anytime you read a paper that carries out a lot of statistical tests, you should immediately look for attempts made by the researchers to correct or adjust for multiple comparisons. If they did not do this, their findings can be doubted based on this consideration alone.

There are many statistical methods used to correct for multiple comparisons. Two of the most common ones are Bonferroni correction (very strict) and the Benjamini-Hochberg false discovery rate method (less strict).

These two methods control different kinds of errors and have different assumptions, so make sure the data fulfills those assumptions before using them.

Hit the heart below if you think that multiple testing issues are important! That way others can more easily find this information.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.