The Problem with the 20th Century Null Hypothesis Significance Testing and what to do about it.

Published in

Solving the Human Problem

7 min readMay 20, 2020

A true story: Friday late afternoon 5:50PM, I was checking out latest commits of the week when uninformed manager pops up into my desk and says “Here is some data. Can you run one of those dope tests only you can make while I play golf over the weekend?” As if the Data Science desk were a “drop at the door” type of delivery in Uber-eats.

I could run tests on the data, but if the manager will not be around for answer questions about the data, in particular on how the data was collected I had a problem. Turns out that the way the data was collected matters for the traditional Null Hypothesis Significance Testing (called NHST for the statistician in the room, or “dope test” for the others).

The fact the analysis results depends on the way the data was collected sounds unintuitive, as if the data itself was attached somehow to the method of collection.

In this article we will reach two different conclusions assuming the same set of data. This is one of the drawbacks of traditional NHST.

The code for this post can be found in this git repo:

mauhcs/bayesian-hypothesis-test

Repo for Bayesian statistics using Python. Contribute to mauhcs/bayesian-hypothesis-test development by creating an…

github.com

This post is a modified version of an example in the book Doing Bayesian Data Analysis by John K. Kruschke.

The Problem and the Data

For simplicity, let’s assume the manager from the beginning of the post had this suspicion that his golf partner had an unfair coin. He dropped by my desk at 5:50PM on Friday and said “There were 34 coin flips. From those, 11 were HEADS. That coin is BOGUS!” After telling me that, he dashed through the door and on to his golf club. I had two questions lingering in my mind:

Was the plan to flip the coin 34 times, in which 11 heads happened to be the case?
Or did he and his friend planned to flip the coin until they got 11 heads and happened to get that on their 34th flip?

As we will see, these two experiments are different enough to yield different results.

Null Hypothesis Significance Testing the bias of a Coin.

The idea of NHST is as follows: The coin has a probability of landing heads, which is unknown. In other words, if we call by θ (theta) the probability of landing heads, we do not know if “θ=0.5” is true or false. We often say that a coin is biased or unfair if θ≠0.5.

If we were to assume that the coin is biased (i.e., θ=0.5 ), what is the probability of observing the data we have (32 flips, 11 heads)? We call ‘θ=0.5’ the Null Hypothesis (H₀), and the goal of NHST is to reject or not reject H₀. Note that NHST does NOT accept H₀, it either rejects with enough evidence, or cannot reject it for lack of evidence.

The last building block of NHST is to decide how often would we be comfortable in rejecting H₀ when it is actually true. In other words, say we perform the same experiment 100 times, if 5 times out of the 100 we concluded that the coin was actually biased when in fact it is not, would we be comfortable with that? Maybe 10 times out of 100 is good enough, or maybe 1 out of 100 is the necessary confidence we need in our test. In any case, we will stick with 5 out of 100, or 5%, as our significance level for the tests performed in this post. Hence, if the probability of observing the given data falls under the region with probability less than our desired confidence level (in this case 5%) we will reject H₀. Another way to think about it goes like that: let’s give the benefit of doubt to the coin and say it is fair, but if we observe data that is so unlikely to happen, then we are forced to say that the coin is not fair.

But how do we find the probability of observing the data? Well, that is where the design experiment matter. In the next sections will see that with the same data we will reject H₀ or not reject H₀ depending on the way we performed the collection.

When the number of Flips is Fixed.

Given an experiment (like a coin flip) that has only two possible outcomes { 0, 1} (or heads and tails), if the probability of the outcome 1 is equal to θ ∈ [0,1], given N trials of the experiment the probability of occurring z events is given by the binomial distribution with the following form:

In our case N=34, z=11 and the null hypothesis (H₀) we have θ=0.5. Here is an example using scipy in Python:

Note that when calling the scipy default function bin to calculate the binomial distribution you noticed that we used the pmf method, it stands for Probability Mass Function. This name is just to make it clear that the probability function we are using is a discrete function (instead of a continuous function).

Looking at the Python code see that the probability of flipping 11 heads out of 34 attempts has probability around 1.6%. If the reader is not careful she might be tempted to reject the null hypothesis, as 1.6% is much less than 5%. “What an unlikely event, it must be an unfair coin!” one might say. However, in order to reject the null hypothesis, the event must fall within an area with low probability, instead. To easily see this, first note that the probability of getting exactly 5,000 heads out of 10,000 flips is only 0.7% (you can easily check it with python), despite the fact that 5,000 is the value that one would most expect from a fair coin. Second, let’s plot the whole binomial distribution with N=34:

In the image above, the critical area (“//”) area has probability less than 5%, and it is constructed in such a way that the area with more than 95% (not the critical area) has the shortest possible width. Because the event z=11 falls just outside the critical area we cannot reject H₀.

Note the binary decision the NHST brings, despite the data observed, it did not decrease our uncertainty about θ. This is yet another problem with the traditional hypothesis testing. We come back to that later. Next we will fix the number of heads and vary the number of flips.

When the number of Heads is fixed.

What if the intent of the coin flip was to flip until there were 11 heads? Well, the probability of getting z heads after N flips is given by the negative binomial (nbin) as follows:

Here is a Python example to run this function:

And the whole distribution is given by the following plot:

And now we know the drill, we look at the critical area in the image given by the segments [11,13] and [32,+∞) and since N=34 is in the critical area we can safely reject H₀ and say that the coin is not fair.

Conclusion

As the reader saw, we were able to reject the null hypothesis when we fixed the number of heads, but we could not reject it when we fixed the number of flips. Although we spent the whole post talking about coins and flips, the implications are more general than this. Imagine that instead of coins we have patients, and instead of heads and tails we have recovered and not recovered from some condition treated by a trial drug. As we saw, the way the data is collected from the patients would be the difference between the drug being considered effective or not against a disease.

Another less obvious implication is the necessary soul seeking quest that NHST throws the experiment designer at. The intentions of the experiment must be crystal clear from the beginning, they should not change (without consequence) and in a later date, when analyzing the data, the experiment designer must know from the bottom of his soul what were his intentions. Moreover, none of this has to do with the data, but it still alters the meaning of it.

Finally, another issue with NHST is its dichotomic nature, one can either reject or not reject. When not rejecting, the extra data is not used to enhance our knowledge of a problem, or to decrease our uncertainty about the problem.

Only if the was a way to perform hypothesis tests focusing only on the data itself (and not on the data collection intention) and if we could decrease our uncertainties even if no categorical answer were possible. Also, is it only me or it would be great if we could “accept” the null hypothesis for once? Well, to solve this problems enters Bayesian Hypothesis testing. A topic for a future post.