A practical guide to Hypothesis testing and ANOVA

6 min readNov 24, 2022

An exhaustive guide to hypothesis testing and ANOVA

What is a hypothesis test?

A hypothesis test is used to test whether the data fit a particular hypothesis. We generally infer some metric (parameter) of the population from a sample (statistic).

We formulate 2 hypothesis, the null hypothesis and the alternate hypothesis and use a test statistic. We check to see the probability of getting the test statistic under the condition that the null hypothesis is true. This represents the chance of seeing a value as extreme as the test statistic under the condition that the null hypothesis is true. If this probability is very low then we know that what we are observing is an actual effect and not due to sampling variability.

This probability is known as the p_value and we formally chose to reject the null hypothesis if this value is below a certain level known as the alpha. Typically if alpha is not given then it is assumed to be 5% or 0.05. Alpha is defined before the test is run

Types of hypothesis tests

We will look into 3 broad categories of hypothesis tests

Test of a single sample against a hypothesized value. eg. Average salary is $100K dollars.
- H0: Average Salary = $100K
- H1: Average Salary <> $100K
- This is a 2 tail test as the null hypothesis is equal to a particular value
Test of 2 samples (Difference of means) test. eg. Packages that arrive late (Wl) tend to be heavier than packages that arrive on time (Wot)
- H0: (Wl-Wot)=0 (No difference in the average weight)
- H1: (Wl-Wot)>0 (Packages that arrive late tend to be heavier)
Test of multiple samples (Variance across multiple categories ANOVA) eg. Job satisfaction has an effect on compensation. Typically job satisfaction will have multiple categories.
- H0: Job satisfaction has no effect on compensation.
- H1: Job satisfaction has an effect on compensation

The Datasets

Now lets try to run these hypothesis tests. We use 2 data sets for this write up.

A stack overflow data set with data from 2,261 software developers that were surveyed with their level of job satisfaction, their compensation and other features (ideal for testing multiple hypothesis).
A dataset with data for packages, their weight, freight category (air ground etc.). This dataset too is ideal for running hypothesis.

Test of a single sample against a hypothesized value

Using the stack over flow dataset for this example. We are hypothesizing the average annual salary converted to dollars is $100K

H0: Average Salary = $100K
H1: Average Salary <> $100K
This is a 2 tail test as the null hypothesis is equal to a particular value

We first load the dataset and calculate the sample statistic

stack=pd.read_feather('https://assets.datacamp.com/production/repositories/5982/datasets/c59033b93930652f402e30db77c3b8ef713dd701/stack_overflow.feather')

# Test statistic
sample_mean=stack['converted_comp'].mean()

We then estimate the standard error of the sample using bootstrapping. Bootstrapping is resampling with replacement from the sample and then calculating the statistic of interest, in this case the mean salary. This gives us a bootstrap distribution of average salary.

# Determining the sample variability and std error using bootstrapping
rep=np.empty(1000)
for i in range(1000):
  # Pulling a sample with replacement from the original sample (resampling)
  sample=stack.sample(frac=1,replace=True)
  mean=sample['converted_comp'].mean()
  rep[i]=mean

Plotting the boot strap distribution gives us a visual idea of how the boot strap distribution is distributed.

# Visually plotting the bootstrap distribution of average salary
fig,ax=plt.subplots()
fig.set_figheight(6)
fig.set_figwidth(10)
ax.hist(rep,bins=20)
ax.set_xlabel('Compensation Annualized $',fontsize=14)
ax.set_ylabel('Counts',fontsize=14)
plt.style.use('ggplot')
plt.show()

Fig 1.1 Graph of boot strap distribution

We can already see that the bootstrap distribution is between $105K to $140K and that a $100K is to the extreme left of the distribution. In this case we can probably see that we would reject the null hypothesis. Now lets formally calculate the z-score and the p-value.

#Find the std error from the bootstrap distribution
std_error=np.std(rep)
print(std_error)

z_score=(sample_mean-100000)/std_error
print(z_score)

# Since its a 2 tail test the region is split in the left and right tails
from scipy.stats import norm
p_val=2*(1-norm.cdf(z_score,loc=0,scale=1))

Once we run this we get a z-score of 3.6 and a p-value of 0.003. This is less than 0.05 so we reject the null hypothesis that the average salary is 100K

Test of 2 samples (Difference of means test)

Using the shipments dataset for this example. We hypothesize that the weight of packages that arrive late ( Wl) is on average more than the weight of packages that arrive on time (Wot). This can be formally stated as a null and alternate hypothesis below

H0: (Wl-Wot)=0 (No difference in the average weight)
H1: (Wl-Wot)>0 (Packages that arrive late tend to be heavier)

In this case instead of using bootstrapping which can be computationally intensive we will estimate the standard error from the sample. Since we are using the estimated standard error rather than running the boot strap simulation we will use the t distribution and the t-score rather than the z-score.

T distribution has fatter tails than the Normal distribution and provides a margin of safety when we estimate standard error from the sample rather than bootstrap and calculate it

Loading the data and checking the average weight for late and on time packages

late=pd.read_feather('https://assets.datacamp.com/production/repositories/5982/datasets/887ec4bc2bcfd4195e7d3ad113168555f36d3afa/late_shipments.feather')

# Calculating the weights by late Yes and No
means=late.groupby('late')['weight_kilograms'].mean()
print(means)

Fig 1.2 Average weight of late and on time packages

We can see that late packages are heavier now to formally test this hypothesis.

Estimated standard error is given by the formula below where Syes and Sno are the standard deviations for late and on time packages and nyes and nno are the corresponding counts

fig 1.3 Formula for estimated standard error

# Getting the counts of the packages that were late
counts=late['late'].value_counts()
nno=counts[0]
nyes=counts[1]

# Calculating standard deviations for the late and not late packages
std=late.groupby('late')['weight_kilograms'].std()
syes=std[0]
sno=std[1]

# Now we can calculate the std error
std_error=np.sqrt(sno**2/nno+syes**2/nyes)
std_error

The estimated standard error is approx 412. Now lets calculate the t_score

# calculate t stat
t_score=(diff_means-0)/std_error
t_score

# Calculate the p_val
from scipy.stats import t
1-t.cdf(t_score,df=n_yes+n_no-2)

p-value is 0.002 much lower than the alpha of 0.05 (5%) corresponding to a 95% confidence level. We reject the null hypothesis that late and on time packages weigh the same on average.

Test of Multiple Samples across categories (ANOVA)

ANOVA is typically used when we want to compare and test means across multiple categories. For this we use the stack overflow dataset to determine if job satisfaction (5 levels) has any statistically significant effect on compensation. Instead of running multiple pairwise tests we will use the pingouin package

H0: Job satisfaction has no effect on compensation.
H1: Job satisfaction has an effect on compensation

# looking at the data to see any differences
stack.groupby('job_sat')['converted_comp'].mean()

fig 1.4 Differences in mean compensation across job satisfaction categories

# Visualizing the difference in compensation across job satisfaction categories
import seaborn as sns
import matplotlib.pyplot as plt
fig,ax=plt.subplots()
fig.set_figheight(8)
fig.set_figwidth(14)
sns.boxplot(x='job_sat',y='converted_comp',data=stack)
ax.set_xlabel('Job satisfaction level',fontsize=14)
ax.set_ylabel('Compensation $',fontsize=14)
plt.show()

fig 1.5 Differences in compensation by job satisfaction levels

Both the box plot and the summary statistics show that there is some difference in average compensation across job satisfaction levels. Now lets test this for statistical significance using the ANOVA function in the pingouin package

import pingouin
pingouin.anova(dv='converted_comp',between='job_sat',data=stack)

The p-unc is the p-value. Since this is 0.001 and much lower than 0.05 corresponding to a 95% Confidence level we can reject the null hypothesis and conclude that job satisfaction has a statistically significant effect on compensation

Lets run a more detailed test to see which relationships are actually statistically significant across these categories

pingouin.pairwise_tests(data=stack,dv='converted_comp',between='job_sat',padjust='bonf')

fig 1.7 Paiwise test for statistical significance results

We can see here that the mean compensation varies in a statistically significant way between slightly satisfied and very satisfied and slightly dissatisifed and very satisfied

Github link-: https://github.com/gmehra123/data_science_projs/blob/main/Hypothesis_testing_article.ipynb

Git hub pages link

https://gmehra123.github.io/data_science_projs/