Types of Errors in Hypothesis Testing
Joey: Hi Chandler, I was just thinking about hypothesis testing about which we were discussing the other day and was intrigued by a doubt, but before jumping into that I want you to check if my understanding about the concept is.
Chandler: Sure Joey! Show me what you know.
Joey: Generally for conducting hypothesis testing we,
- Formulate the null and the alternative hypothesis: They are two mutually exclusive and collectively exhaustive statements that we make about the population parameter. An important caveat in formulating these statements is that the null hypothesis is a commonly accepted fact (or default value) and alternative hypothesis is a statement which people want to test on.
- Calculate the test statistic for the sample data: Test statistic compares the sample data with the expected value of the population parameter which was hypothesised and helps us to make a decision in hypothesis testing. CLT plays a major role in calculating the test statistic. The test statistic in a Z-test indicates the distance between the sample mean and the hypothesis mean (H0) in terms of standard deviation. If Z score is 0.25, it means that sample mean is 0.25 standard deviations away from the hypothesized mean. A very high or a very low (negative) Z scores indicate that sample mean is very much different from hypothesized mean (H0). In other words, the sample mean has comes from a different distribution other than the null distribution.
- Calculate the p-value: It is a way to quantify the Z-score in terms of the probability. Given the null hypothesis is true p-value represents the probability of seeing a sample as extreme as the one which we have. So higher its value, higher is the probability that our null hypothesis is true.
- Making a decision: In order to make a decision i.e. either accepting the null hypothesis or failing to reject the null hypothesis we fix some levels of significance which is usually represented by α and is predefined. Typical values for α are 0.1, 0.05, and 0.01 depending on the application. The decision is then made by comparing the p-value with α,
p-value > α (Decision: Fail to reject the null hypothesis)
p-value < α (Decision: Reject the null hypothesis)
Chandler: Brilliant, even I didn’t understand most of the concepts the first time when I read about it, but you did a great job. So what is your doubt then?
Joey: I am not sure how to formulate my question properly. Okay, let me ask my question using an example. Assume that I have a coin and going to test the hypothesis that its unbiased (It means that probability of getting head and tail are equal). Therefore, the null and alternative hypotheses are
Where, P is Probability of getting a head.
I toss the coin 10 times (sample size n) to estimate the probability of getting a head (p). I understand that when we flip a coin 10 times we “expect” 5 heads and 5 tails in an ideal situation (if the coin is unbiased), but we are not surprised even if we don’t get this. The reason being that we don’t flip the coin similarly every time due to many unknown factors which influences the coin flips. This phenomenon is known as random variability. But thanks to central limit theorem which helps us to define a theoretical distribution for the coin flip which captures this random variability. According to central limit theorem, the theoretical distribution for the probability of getting a head follows a normal distribution. The theoretical distribution for an unbiased coin (null distribution) is shown in the below figure.
In order to find out if the coin is biased or not, all we need to do is to calculate an estimate of the probability p using the sample, and locate the value in the theoretical distribution. If the estimated probability is so close to the mean, we can say with high confident that we cannot reject the fact that the coin is unbiased, on the other hand if the estimated probability is far away from the mean, then we can say for sure that the coin is biased (of course with an error of α involved).
Am I right about my inferences till now?
Chandler: Yeah, you are right. I kind of guessed your doubt. But still go on.
Joey: Okay. After calculating the p-value we fix a threshold α in the theoretical distribution to arrive at a decision. For the time being let this significance level α be 0.1, now if the probability estimate fall above 0.95 or below 0.05 we conclude that coin is not unbiased. My doubt is that, it is perfectly possible that our coin is unbiased, but still we may get a p value which is far away from the theoretical mean due to random variability. So, it is not fair to take a decision just by comparing p-value with α since it may possible that we tossed an unbiased coin and still get an estimate above 0.95 or below 0.05 due to random variability.
Chandler: yeah, let me explain. The other name for significance level α is called as Type I error. Type I error is nothing but the probability of rejecting the null hypothesis when it is true.
α = P (Rejecting the null hypothesis | when the null hypothesis is true)
Assume that the Type I error or α is 0.05 and we conduct the experiment 100 times, where each experiment involves tossing a coin 10 times and taking a decision, now on an average 5 times we reject the null hypothesis incorrectly due to reason Type 1 error. We usually can’t avoid this error. But, we can reduce it by increasing the sample size and/or decreasing the significance level.
Joey: Yeah, I have heard about it before. Sometimes, people also talk about another type of error called Type II error, what is it?
Chandler: Let me explain this through a figure.
There are two normal distributions in the above figure. The left one indicates the null distribution i.e. the theoretical distribution created for null hypothesis. Since we don’t know if the coin is biased or not we define a new theoretical distribution for the actual case. Let us say for the sake of argument we already knew that the coin is biased and the probability of getting a head is 0.8. So, the distribution on the right is the theoretical distribution in reality. The region shaded by green indicate the rejection region, it means that if the estimate (p) falls in the green region we reject the null hypothesis. And that becomes the right decision in this case since we know that coin is biased. Now, focus on the region shaded by the red colour. If an estimate falls in the red region, we will fail to reject the null hypothesis thus making an incorrect decision. This type of error is called Type II error. We already knew those null and alternative hypotheses are mutually exclusive and collectively exhaustive statements. Therefore, we can consider an actual scenario of theoretical distribution as alternative hypothesis distribution, so the above diagram becomes
Let me write show it in a tabular form.
Type 1 error = α = P (Rejecting the null hypothesis | when it is true)
Type 2 error = β = P (Fail to reject the null hypothesis | when it is false)
Interpretation of table: If H0 is true and we fail to reject H0, then we make the right decision. But when H0 is false and if we fail to reject H0 then we commit Type II error.
Joey: So you mean to say that Type II error depends on the actual population mean which is generally unknown in real life?
Chandler: oh yes!! And that’s the reason why the probability of Type II Error cannot be generally computed. It is impossible to quantify Type II error (β) unlike Type I error which is given by α. Type II error depends on three factors, the difference between the null hypothesis mean and the actual mean, noise in the system (σ) and the sample size. But it is necessary to balance both α and β for good hypothesis testing. The only way to control the β is through the sample size because we can’t control other two factors.
Usually statistician/researcher use what is known as power analysis to decide on a good sample size for hypothesis testing. The power of hypothesis testing is nothing more than 1 minus the probability of Type II error. Basically the power of a test is the probability that we make the right decision when the null is not correct
Power = 1−β = P (Rejecting the null hypothesis| when it is false)
Power is the likelihood that a study will detect an effect when there is an effect in reality. If the power is high, then the probability of committing Type II error will be less. It also helps to calculate the minimum sample size required that would detect an effect reasonably.
The blue region in the above figure is the power (which is nothing but 1-β). Conventionally, we choose the power to be 0.8, meaning that the researcher desires a <20% chance of failing to reject the null hypothesis when it is false. The higher the power in a research study the less likely it is to have a Type II error. Greater power also requires higher sample size. So by increasing the sample size we can avoid both the type 1 and type 2 errors.
We can see that power is a function of noise in the system (σ), α (Type I error), effect size (∆), alternative hypothesis and sample size. All these five variables including the power are closely related to each other. Usually, the sample size calculation is a function of σ, α, effect size (∆), β and power. It is given by
Joey: Interesting, then what is the right sample size for our experiment of finding if the colour of the website has any impact on the number of clicks?
Chandler: For now let’s assume that α = 0.05, β =0.1 and the error in the system is 25. Also assume that we know that the true value of number of clicks when the colour is blue (say, 50) and the hypothesized value is 30, we can compute ∆ to be 20
Now from the Z table we find Z (0.05) = 1.644 and Z (0.9) =1.281, so we have all the ingredients for the above formula we can just substitute and get a value for n
So, we need approximately 27 samples to have 90 percent power to detect a 20 difference from 30.
The author of this blog is Balaji P who is pursuing PhD in reinforcement learning at IIT Madras