I2Pr Hypothesis Testing

Cao Jianzhen
AI Skunks
Published in
13 min readMar 9, 2023

Authors

Jianzhen Cao, Nik Bear Brown

Neyman and Egon Sharpe Pearson, statisticians and co-authors of The Theory of Statistical Hypothesis Testing

What is hypothesis testing

There are so many opinions in the world. The opinions can be right or wrong, many of which are not easily to distinguish. We need help to find which opinion to accept and statistics offers a way to solve it, that is, hypothesis testing.
In statistics, opinion is called hypothesis. We can choose to accept or reject a hypothesis with some evidence. So briefly speaking, hypothesis testing is the process of collecting evidence and deciding whether to accept the hypothesis.

Let’s show a concrete example. Assume a company has introduced a new product and wants to know if it is more effective than their previous product. They have collected data on the sales of both products over a period of time, and they want to perform a hypothesis test to see if the new product is significantly better than the old one.

Now there are two hypothesis:

(a) is better

(b) not better

The company actually wants to see (a), so it would then collect and analyze data using statistical methods to see if they can reject (b), which means (a) is right in fact. How to perform the test? Let’s get started.

Presumption of innocence

Now assume we have a hypothesis, denoted as 𝐻0. How can we decide to accept or reject it?
Hypothesis testing takes a cautious approach, just like presumption of innocence. We should trust this hypothesis first, unless we find something so weird which is not supposed to happen when 𝐻0 is valid. That is, if we gets the evidence really contradicting 𝐻0, then reject 𝐻0, otherwise we cannot reject it, so we have to accept it.
In statistics the weird things are called small probability events. The basic idea of ​​hypothesis testing is the principle of small probability events, which means that a small probability event will basically not happen in an experiment.

Basic steps

There are many testing methods used in different conditions. However, we can summarize the whole testing process in 5 basic steps.

  1. Put forward the null hypothesis 𝐻0 and the alternative hypothesis 𝐻1, which is exactly opposite to 𝐻0.
  2. Construct the test statistic.
  3. Set significance level α, usually equal to 0.05.
  4. Based on the sample data, calculate the value of test statistic.
  5. According to the value of test statistic, check relevant table(like t table) to get p value. If p value is less than significance level α, reject 𝐻0, otherwise accept 𝐻0.

Why?

Why does the process work? Maybe you have raised some doubts, like

  1. What does α mean?
  2. How does α work?
  3. Why choose α = 0.05?
  4. What about p value?
  5. ……

Why? There is an understanding gap between conditions which we have and conclusion which we draw. We need to do further research in order to answer these questions. But before that, let’s do a hypothesis testing as an example.

Example 1

Here is an example of environment problem. Experts report that from 21st century, the average daily solid waste weight in Ballarat is 69900kg, and we have some samples from 2000 to 2015. We want to know whether experts’ opinion is true or not. Especially we want to make sure whether the waste is more in fact!

import pandas  as pd
import matplotlib.pyplot as plt
import seaborn as sns

url = "https://raw.githubusercontent.com/rrRIOoo/data_cache/main/Daily%20Solid%20Waste%20Dataset/open_source_ballarat_daily_waste_2000_jul_2015_mar.csv"
data = pd.read_csv(url)
print(data.head())
sns.displot(data['net_weight_kg'])
plt.show()
   ticket_date  net_weight_kg
0 2000-07-03 52380
1 2000-07-04 62940
2 2000-07-05 48260
3 2000-07-06 55580
4 2000-07-07 57800

As we can see, the weight data distribution is basically closed to normal distribution. We can use T-test after cleaning data since there are noise points near 0.

T-test, in short, is one of hypothesis testing methods, which follows the basic steps as well, used to compare two means and determine whether they are statistically different from each other. The t-test is very useful when the sample size is small (typically less than 30) and the population standard deviation is unknown.

There some conditions to use T-test

  1. The observed variable is continuous
  2. The observed variable is independent
  3. There are no significant outliers
  4. The observed variable approximates or obeys a normal distribution.

In this example we don’t know population standard deviation and the data meets all conditions, so we use T-test. Below we will introduce how to do a t-test step by step.

data = data.drop(data[data['net_weight_kg'] < 2e4].index)
sns.displot(data['net_weight_kg'])
print("sample daily weight: ", data['net_weight_kg'].mean())
plt.show()
sample daily weight:  69958.84840564558

We begin step 1 with putting forward an 𝐻0 and 𝐻1.

𝐻0: The sample mean is equal to the population mean

𝐻1: The sample mean is not equal to the population mean

In step 2 we construct test statistic T

T = (w̅ — μ) / σ

where 𝑤̅ is mean of samples, μ is 69900 and σ is standard deviation of samples.
This statistic is only used in T-test. There are also other different test methods, like F-test, which have their own formula, but not the focus of this tutorial.

Next in step 3 we set α = 0.05.

Then in step 4 we calculate T statistic in these samples.

from scipy import stats

test = stats.ttest_1samp(a=data['net_weight_kg'], popmean=69900, alternative='greater')
print(test)
Ttest_1sampResult(statistic=0.2784293471243657, pvalue=0.39034892211056227)

In the last step, we can find in the variable test that there is a pvalue=0.39034892211056227, much larger than α = 0.05, so we cannot reject 𝐻0, we choose to accept 𝐻0 is true.
Ok, we have completed a hypothesis testing according to 5 basic steps! Next we will try to solve the doubts above.

One & Two-Sided

Let’s start with another question.
Please scroll back to read the code below step 4, where you can see a parameter alternative. This parameter has 3 optional values: less, greater, two-sided. What does it mean? Why we set greater?
Our goal is to test whether population mean is equal to sample mean.
Or generally, we want to test whether a parameter of population is equal to a parameter of sample.
Or more generally, we want to test whether a parameter of population satisfies some certain conditions.
So we can make 𝐻0 and 𝐻1 just like

𝐻0:θ∈Θ0

𝐻1:θ∈Θ1

where θ is a parameter of population and

Θ0+Θ1=Θ

Θ0∩Θ1=∅

Obviously there are two types of null hypothesis

  1. simple null: Θ0 contains only 1 point, that is, 𝐻0:θ=θ0
  2. composite null: Θ0 contains many points, that is, 𝐻0:θ≤θ0 or 𝐻0:θ≥θ0

Correspondingly, alternative hypothesis also has two formats

  1. two-sided: 𝐻1:θ≠θ0
  2. one-sided: 𝐻1:θ<θ0 or 𝐻1:θ>θ0

Now you can get what alternative means. And in our waste weight problem, it seems that we should take two-sided as alternative’s value. But actually we just want to know whether the waste is more than experts’ result. We don’t care less waste, do we? If less, surprise! So we can modify our 𝐻0 as

𝐻0: The sample mean is equal to or less than the population mean

and correspondingly,

𝐻1: The sample mean is greater the population mean

P value

Based on the sided theory, we can see p value clearly.
Recall the basic idea of hypothesis testing, how can we reject 𝐻0? To collect evidence, the weird things that are not supposed to happen when 𝐻0 is true, the small probability things.
What we have now is the sample mean, 69958.84840564558. But is it weird enough to help us reject 𝐻0? Is the probability small enough?
We want the probability P(sample mean ≥ 69958.84840564558).
If the probability is very small, what does it mean? It means based on 𝐻0, this thing, sample mean ≥ 69958.84840564558 is very unlikely to happen. But it really happens! So what we can do? Suspect the truth of 𝐻0.

This is what p value means, probability of weird thing(sample mean = 69958.84840564558) and even more weird thing(sample mean > 69958.84840564558). We can get p value by calculating T statistic and checking T table.

Here df is equal to number of samples-1. After calculating df(first column) and T statistic(main content), we can get the p value(first row).
When p value reaches a really small threshold, or even under the threshold, we say there is adequate evidence to reject 𝐻0. This threshold is significance level, α.

Significance level

How to understand the word significance?
In my opinion, significance here can seen as notable, remarkable. We are basically not ambiguous whether 𝐻0 is false when p value reaches α because the evidence collected looks very clear. If we accept 𝐻0, something so weird happens, which is obviously unreasonable. So we choose to reject 𝐻0 if p value is less than α, even though it is still possible for us to make a mistake.

Type I & II Error

There are two types of error in summary.

Type I Error: 𝐻0 is true, but rejected

Type II Error: 𝐻0 is false, but accepted

Problems may come from sample. Imagine a situation like this: 𝐻0 holds true, but we take many deviating samples by coincidence. As image below, there are much more blue points than red points in population but only two blue ones in sample, which are less than red ones. This sample may mislead us to reject 𝐻0, which is a type I error. Type II error is similar.

In scientific research, type I error is more important. Let’s say if you are a researcher, you have doubts about an existing conclusion. What should you do with the help of hypothesis testing to demonstrate the existing concluion is false? Take it as 𝐻0, calculate statistic, compare p value with α and choose to reject or accept 𝐻0, right?
If you find you cannot reject 𝐻0, just continue to adopt 𝐻0, or design more tests if you still are skeptical about it. So even though you make a type II error, it is just your own business. Research further or stop now, either okay.
However, if you reject 𝐻0, you must make sure your rejection is convincing enough because usually what you reject is prevalent or authoritative, which may be a challenge against a famous professor! Your attitude should be so cautious and your evidence should be so strong that others are willing to believe you instead of common sense or authority, which requires a very low probability of type I error.

So in hypothesis testing, we need to control the probability of type I error under a certain level(and then make probability of type II error low as possible), which is also significance level, α. The word significance here refers to important because it is a really important factor to make your conclusion convincing.
In short, significance level works as a limit for both p value and probability of type I error.

Methods to do hypothesis testing

We have introduced T-test above, and there are still many other methods to do hypothesis testing. Here we summarize some of them for reference.

Z-test: Z-test is a method used to test whether the mean of a sample is significantly different from the known overall mean. The conditions of use include:

  1. Population distribution: The population should be normally or approximately normally distributed.
  2. Sample size: The sample size should be large enough, and the sample size is generally required to be greater than or equal to 30.
  3. Population Variance: The population variance is known.

When performing the Z test, it is necessary to clarify the problem first, calculate the Z value and compare the significance level, so as to draw a conclusion whether there is a significant difference between the sample mean and the overall mean. It should be noted that if the population variance is unknown, the t-test should be used instead of the Z-test.

F-test: F-test is a method used to compare whether two or more sample variances are significantly different, and its use conditions include:

  1. Sample independence: Each sample should be independent, that is, there is no correlation between each sample.
  2. Normality: The data for each sample should come from a population that is normally or approximately normally distributed.
  3. Homogeneity of variances: The variances of the samples in each group should be equal.
  4. Sample size: The sample size of each group should be large enough, and it is generally required that the sample size of each group is greater than or equal to 30.

When performing the F-test, it is necessary to first determine the problem, calculate the F-value and compare the significance level, so as to draw a conclusion whether there is a significant difference between two or more sample variances. It should be noted that if the data do not meet the above conditions, it may be necessary to use non-parametric tests instead of F-test. In addition, the F-test can also be used in analysis of variance to compare the size of variance within and between groups.

ANOVA: Analysis of variance (ANOVA) is a method used to compare whether two or more sample means are significantly different, using conditions including:

  1. Sample independence: Each sample should be independent, that is, there is no correlation between each sample.
  2. Normality: The data for each sample should come from a population that is normally or approximately normally distributed.
  3. Homogeneity of variances: The variances of the samples in each group should be equal.
  4. Sample size: The sample size of each group should be large enough, and it is generally required that the sample size of each group is greater than or equal to 30.

When performing variance analysis, it is necessary to determine the problem first, select the appropriate type of variance analysis (one-way analysis of variance, two-way analysis of variance, etc.), calculate the F value and compare the significance level, so as to obtain whether two or more sample means Significantly different conclusions exist. If the data do not meet the above criteria, it may be necessary to use nonparametric tests instead of ANOVA.

Correlation test: correlation test is a method used to test whether there is a linear correlation between two variables, and its use conditions include:

  1. Variable type: Both variables should be continuous variables.
  2. Data Type: Data should be paired, i.e. each observation contains measurements of both variables.
  3. Normality: The data for each variable should come from a population that is normally or approximately normally distributed.
  4. Independence: Each observation should be independent, i.e. there should be no correlation between each observation.

In correlation test, Pearson correlation coefficient or Spearman rank correlation coefficient can be used. The Pearson correlation coefficient is suitable for populations where both variables are from a normal distribution, while the Spearman rank correlation coefficient is suitable for data from non-normal distribution populations or ordered categorical variables. It should be noted that the correlation test can only test the linear correlation between two variables. For nonlinear relationships or other types of relationships, other methods need to be used for analysis.

Example 2

OK since you have understood the basic idea and process of hypothesis testing, let’s end this tutorial with another example.
This is a example of car. Assume we want to buy a car with a low fuel consumption and salesman recommends a CADILLAC car. He claims average fuel consumption of CADILLAC car is less than 12.7 liters of petroleum per 100 kilometers. Should this guy be trusted? We need to use some samples to do testing.

import pandas  as pd
import matplotlib.pyplot as plt
import seaborn as sns

url = "https://raw.githubusercontent.com/rrRIOoo/data_cache/main/Fuel_Consumption_2000-2022.csv"
data = pd.read_csv(url)
print(data.head())
data = data.drop(data[data['MAKE'] == 'CADILLAC'].index)
print(data['FUEL CONSUMPTION'].mean())
sns.displot(data['FUEL CONSUMPTION'])
plt.show()
   YEAR   MAKE    MODEL VEHICLE CLASS  ENGINE SIZE  CYLINDERS TRANSMISSION  \
0 2000 ACURA 1.6EL COMPACT 1.6 4 A4
1 2000 ACURA 1.6EL COMPACT 1.6 4 M5
2 2000 ACURA 3.2TL MID-SIZE 3.2 6 AS5
3 2000 ACURA 3.5RL MID-SIZE 3.5 6 A4
4 2000 ACURA INTEGRA SUBCOMPACT 1.8 4 A4

FUEL FUEL CONSUMPTION HWY (L/100 km) COMB (L/100 km) COMB (mpg) \
0 X 9.2 6.7 8.1 35
1 X 8.5 6.5 7.6 37
2 Z 12.2 7.4 10.0 28
3 Z 13.4 9.2 11.5 25
4 X 10.0 7.0 8.6 33

EMISSIONS
0 186
1 175
2 230
3 264
4 198
12.744606387764282

We decide to use T-test again. Our null hypothesis and alternative hypothesis

𝐻0: average fuel consumption is equal to or less than 12.7 liters of petroleum per 100 kilometers

𝐻1: average fuel consumption is greater than 12.7 liters of petroleum per 100 kilometers

And we construct test statistic T

T = (w̅ — μ) / σ

We set α = 0.05 and calculate T.

from scipy import stats

test = stats.ttest_1samp(a=data['FUEL CONSUMPTION'], popmean=12.7, alternative='greater')
print(test)
Ttest_1sampResult(statistic=1.8957032876238749, pvalue=0.029006124165054846)

pvalue=0.029006124165054846<0.05, should we trust the salesman? Of course not! We can reject the 𝐻0.

Quiz

  1. When we want to test a hypothesis, we should assume it is()
    A.true B.false
  2. What are 𝐻0 and 𝐻1?
  3. Assume 𝐻1:𝑎>1, what is 𝐻0?
  4. What is formula of T statistic?
  5. What is α?
  6. What does p value mean?
  7. If α = 0.01, pvalue = 0.02, can we reject 𝐻0?
  8. What is type I error?
  9. What is type II error?
  10. What is the basic idea of hypothesis testing?

Answer

  1. A
  2. null hypothesis and alternative hypothesis
  3. 𝐻0:𝑎≤1
  4. T = (w̅ — μ) / σ
  5. significance level, usually set 0.05
  6. probability of weird thing and even more weird thing based 𝐻0
  7. no
  8. reject 𝐻0 when 𝐻0 is true
  9. accept 𝐻0 when 𝐻0 is false
  10. principle of small probability events

Exercise

Melbourne water and Melbourne airport weather station reports that daily energy consumption is under 275000. Please do a hypothesis testing and choose to reject or accept it.

import pandas  as pd
import matplotlib.pyplot as plt
import seaborn as sns

url = "https://raw.githubusercontent.com/rrRIOoo/data_cache/main/Data-Melbourne_F_fixed.csv"
data = pd.read_csv(url)
print(data.head())
sns.displot(data['Energy Consumption'])
plt.show()
   Unnamed: 0  Average Outflow  Average Inflow  Energy Consumption  Ammonia  \
0 0 2.941 2.589 175856.0 27.0
1 1 2.936 2.961 181624.0 25.0
2 2 2.928 3.225 202016.0 42.0
3 3 2.928 3.354 207547.0 36.0
4 4 2.917 3.794 202824.0 46.0

Biological Oxygen Demand Chemical Oxygen Demand Total Nitrogen \
0 365.0 730.0 60.378
1 370.0 740.0 60.026
2 418.0 836.0 64.522
3 430.0 850.0 63.000
4 508.0 1016.0 65.590

Average Temperature Maximum temperature Minimum temperature \
0 19.3 25.1 12.6
1 17.1 23.6 12.3
2 16.8 27.2 8.8
3 14.6 19.9 11.1
4 13.4 19.1 8.0

Atmospheric pressure Average humidity Total rainfall Average visibility \
0 0.0 56.0 1.52 10.0
1 0.0 63.0 0.00 10.0
2 0.0 47.0 0.25 10.0
3 0.0 49.0 0.00 10.0
4 0.0 65.0 0.00 10.0

Average wind speed Maximum wind speed Year Month Day
0 26.9 53.5 2014.0 1.0 1.0
1 14.4 27.8 2014.0 1.0 2.0
2 31.9 61.1 2014.0 1.0 5.0
3 27.0 38.9 2014.0 1.0 6.0
4 20.6 35.2 2014.0 1.0 7.0

Reference

--

--