A Comprehensive Guide to Hypothesis Testing🔬

10 min readMar 27, 2022

…

In the world of data and decision-making, uncertainty is inevitable. Whether you’re a scientist testing a new drug, a marketer evaluating the success of a campaign, or a data analyst trying to draw meaningful conclusions from sample data, hypothesis testing is your guiding light. It’s a powerful statistical tool that helps us make informed decisions, even when faced with uncertainty. But how do we know if we’re making the right call? Are we rejecting the right hypothesis, or are we falling victim to error? In this blog, we’ll dive deep into the essentials of hypothesis testing, from p-values to error types, and arm you with the knowledge to interpret results confidently. Whether you’re a beginner or a seasoned data scientist, this guide is designed to sharpen your statistical intuition and elevate your decision-making game.

Hypothesis Testing

Definition: Hypothesis testing is a statistical method used to assess the plausibility of a hypothesis using sample data. The process involves four key steps:

Model Setup: We assume a parametric model where parameters lie in a set Θ ⊆ R (the set of real numbers).
Null and Alternative Hypotheses: We define two non-empty sets, Θ₀ (the null hypothesis) and Θ₁ (the alternative hypothesis), such that Θ = Θ₀ ∪ Θ₁ and Θ₀ ∩ Θ₁ = ∅.
Test Statistic Selection: A statistic T, called the test statistic, is chosen to measure the evidence against the null hypothesis.
Rejection Region: We define a region R ⊆ R, called the rejection region. The test rejects the null hypothesis if T ∈ R, and fails to reject if T ∉ R.

P-value

The p-value tells us the likelihood that the observed data could have occurred under the null hypothesis. Instead of using a single rejection region, we compute the p-value, which helps us to evaluate results against various significance levels. A lower p-value suggests stronger evidence against the null hypothesis.
Key Concept: The p-value is the smallest significance level at which the null hypothesis can be rejected.

**The p-value, or probability value, tells you how likely it is that your data could have occurred under the null hypothesis.**

Significance Level

A test procedure has a significance level (α) when the probability of rejecting the null hypothesis is at most α under the null hypothesis. Typically, α is set at 0.05 (5%), meaning that there is a 5% risk of a Type I error (false positive).

Type I and Type II Errors

Type I Error (α): False positive

This means rejecting the null hypothesis when it’s true. This is often the primary concern in hypothesis testing.
The risk of committing this error is the significance level (alpha or α) you choose. That’s a value you set at the beginning of your study to assess the statistical probability of obtaining your results (p-value). The significance level is usually set at 5%. This means your results only have a 5% chance of occurring, or less if the null hypothesis is true.

Type II Error (β): False negative.

This means failing to reject a false null hypothesis. This is not quite the same as “accepting” the null hypothesis because hypothesis testing can only tell you whether to reject the null hypothesis.
Instead, a Type II error means failing to conclude there was an effect when there was. In reality, your study may not have had enough statistical power to detect an effect of a specific size. Power is the extent to which a test can correctly detect a real effect when there is one. A power level of 80% or higher is usually considered acceptable. The risk of a Type II error is inversely related to the statistical power of a study. The higher the statistical power, the lower the probability of making a Type II error.

Tradeoff between Type I and Type II errors

The Type I and Type II error rates influence each other. That’s because the significance level (the Type I error rate) affects statistical power, which is inversely related to the Type II error rate.
Setting a lower significance level decreases a Type I error risk but increases a Type II error risk. Increasing the power of a test decreases a Type II error risk but increases a Type I error risk.
This trade-off is visualized in the graph below. It shows two curves:

🔹 The null hypothesis distribution shows all possible results you’d obtain if the null hypothesis is true. The correct conclusion for any point on this distribution means not rejecting the null hypothesis.

🔹 The alternative hypothesis distribution shows all possible results you’d obtain if the alternative hypothesis is true. The correct conclusion for any point on this distribution means rejecting the null hypothesis.

Type I and Type II errors occur where these two distributions overlap. The blue shaded area represents alpha, the Type I error rate, and the green shaded area represents beta, the Type II error rate.
The plot shows that using a higher p-value threshold means a higher Type I error rate but a smaller chance of missing a real difference!

Power

The concept of Type II errors is better known as power.
Power is the probability of detecting a difference between the variances, rejecting the null when there really is a difference. Power = 1-Type II Error(failing to reject the null when there really is a difference)
Power is typically parameterized by delta, δ, the minimum delta of practical interest. Mathematically, assuming the desired confidence level is 95%, the equation is as Power δ = P(|T| ≥ 1.96 |true diff is δ).
The industry standard is to achieve at least 80% power in our tests. Therefore, it is expected to conduct a power analysis before starting the experiment to decide how many samples are needed to achieve sufficient power.
Assuming Treatment and Control are of equal size, the total number of samples you need to achieve 80% power can be derived from the power formula above and is approximate as shown:

sigma: the sample variance; delta: the difference between control and treatment

Power Function, Size

The power function π of a test is defined by π(θ)=P(T∈R;θ) for θ∈Θ. That is, for each parameter θ, the power function gives the probability of rejecting the null hypothesis. The size s of a test is defined by s = sup π(θ), where θ∈Θ0.
An equivalent way to think of the size is as the smallest significance level satisfied by the test.

Power Function for N=100 VS N=500 for Similar Size Tests

Note that when θ = 1/30, the value suggested by the engineers, we have π(1/30) = 0.15. In other words, there is only a 15% chance of rejection when θ = 1/30.
If we grow the rejection region we could increase this probability but at the cost of a larger chance of type I error. The only way to both increase the chance of rejecting when θ = 1/30 and keep the type 1 error low is to have more data. For example, if we instead had 500 data points then we could maintain the significance level and increase the threshold to 16 resulting in the following power function.

1. Two-sided test for the mean of a Normal Distribution when the variance is unknown

Consider the blood pressure example from the first lecture. Suppose that we have n iid measurements X1,…, Xn ∼ N(μ,σ ) where μ,σ are unknown. We want to see if the data supports the claim that our true systolic blood pressure μ is different from 120.
The null hypothesis is that μ = 120 and σ > 0:

Θ0 ={(μ,σ):μ=120,σ>0}

The alternative hypothesis is that μ =/ 120 and σ > 0:

Θ1 ={(μ,σ): μ=/ 120, σ>0}

We will choose the standardized mean (which has expectation 0 under the null hypothesis) as our test statistic T:

Assuming the null hypothesis is true, T will have a t-distribution with n − 1 degree of freedom. If our true blood pressure is much higher than 120, we would expect T to be positive. Analogously, if our true blood pressure is much lower than 120, we would expect T to be negative. Thus we will choose the rejection region R to account for either of these possibilities (this is called a 2-sided test).
If we want a test of size 0.05 (i.e., 5% chance of type I error), we choose R to be R = (−∞, F^(-1)(0.025)) ∪ (F^(-1)(0.975), ∞), where F^(-1) is the inverse-CDF of the t-distribution with n − 1 degree of freedom T (the inverse CDF is also called the quantile function).

2. Two-sided test for the mean of a normal when the variance is known or the sample is large

In the preceding example, we considered the t-test, where both the mean μ and variance σ2 were unknown. If we instead assume the variance is known, then we could use the simpler z-test with the test statistic:

Our size 0.05 rejection region would be given by: R = (−∞, Φ^(-1)(0.025)) ∪ (Φ^(-1)(0.975), ∞) = (−∞, −1.96) ∪ (1.96, ∞), where Φ is the CDF of a standard normal distribution.
For large n, even if our data isn’t normally distributed, we still expect 👇. Then we can still perform an approximate z-test (no need to apply a t-test, since for large n the t-distribution is almost identical to the standard normal).

3. One-sided test for the mean of a normal distribution

4. Two-sided test for two samples from different Bernoullis

Suppose we have two basketball players shooting 3-point shots, and we want to know if one player is better. Let X1, . . . , Xn1 ∼ Bernoulli(p1) denote our model for the first player’s shots, and let Y1, . . . , Yn2 ∼ Bernoulli(p2) denote our model for the second player’s shots. Because we have two independent samples (from the two players) we will perform a two-sample test. We assume n1 and n2 are both large, but we allow for the case where n1 =/ n2.
The null hypothesis is:

The alternative hypothesis is:

Test Statistics is:

The complicated denominator above estimates the standard deviation of Xn1 − Yn2, assuming p1=p2 (assuming the null hypothesis). This allows us to pool all of our data together to estimate p1=p2.
We could compute the estimates separately as 👇, and obtained the slightly different unpooled version. Since n1 and n2 are large, we assume that T is approximately N (0, 1) distributed, and perform a 2-sided z-test.

5. Unpaired two-sample t-test

Suppose we have two samples with all data independent, and μ1, μ2, σ1, and σ2 are all unknown. Our goal is to test whether μ1 =/ μ2 (we can also do a one-sided test for μ1 > μ2). We assume that n1 and n2 are fairly small here, as otherwise, we could simply use a normal approximation. We will consider two cases.

(1) It is known that σ1 ≈ σ2, then we can compute the following pooled estimate for the standard deviation 👇, the -2 in the denominator is connected to the fact that we subtract two different means in the numerator. The test statistic we use is thus:

(2) It is unknown if σ1 ≈ σ2, or it is known they are different: Then we use the unpooled estimate for the standard deviations and obtain the test statistic

Reference

Notes of Professor Carlos Fernandez-Granda in DSGA-1002

A Comprehensive Guide to Hypothesis Testing🔬

Reference

Written by WENXIN