Hypothesis Testing: Significance Thresholds and Multiple Hypothesis Tests

Published in

Data 100

3 min readJun 30, 2024

Ever wondered about the pitfalls of multiple hypothesis testing? I saw this comic strip which humorously addresses the initial question of whether jelly beans cause acne.

At first, scientists find no link, but then they run 20 more hypothesis tests looking for a link between specific colors of jelly beans and acne. Only one of those tests results in a significant p-value and as shown in the final panel of the comic, it is the only test that gets published. The comic perfectly describes how running multiple hypothesis tests and only reporting significant results is extremely problematic. This has contributed to the reproducibility crisis in statistics. At a .05 significance level, approximately 5% of hypothesis tests will result in a false positive; but if we only report those false positives, we end up with a lot of published results that can’t be replicated. This issue is related to what a lot of people call “P hacking”.

Type I Error and Multiple Testing

Type I error (false positive) occurs when we incorrectly reject a true null hypothesis. The probability of making a Type I error in a single test is denoted by α (commonly set at 0.05).

When performing multiple tests, the overall probability of committing at least one Type I error across all tests increases. This phenomenon is due to the compounding nature of probabilities over multiple tests. This is depicted in this graph:

Probability of Type I error as number of tests increases.

Intuitive Explanation

Imagine you are testing hypotheses independently, with each test having a 5% chance of a Type I error:

For one test, the chance of not committing a Type I error is 1−α = 0.95.
For two independent tests, the probability of not committing a Type I error in either test is 0.95×0.95 = 0.9025.

The probability of committing at least one Type I error is: 1 − 0.9025 = 0.0975.

As you conduct more tests, this probability continues to increase:

For n tests, the probability of not committing a Type I error in any test is (1-α)^n
Thus, the probability of committing at least one Type I error in n tests is: 1−(1−α)^n

Example Calculation

Suppose α = 0.05 and you perform 10 tests:

Probability of no Type I error in 10 tests = (1−0.05)^10 = (0.95)^10 ≈ 0.599

Probability of at least one Type I error in 10 tests = 1 − 0.599 = 0.401

So, there’s about a 40.1% chance of committing at least one Type I error across 10 tests. As we increase the number of tests, the probability of at least one Type I error will increase as well.

Adjusting for Multiple Testing

To control the overall Type I error rate, researchers use techniques such as:

Bonferroni Correction: Adjusts the significance level by dividing it by the number of tests (α/n).
False Discovery Rate (FDR): Controls the expected proportion of false positives among the rejected hypotheses.

Practical Implications

In research, increasing the number of tests without adjustment inflates the risk of false positives. This is why multiple comparison procedures are critical in ensuring the reliability of results.