Testing a Difference in Population Proportions in Python

Satya Pattnaik
Analytics Vidhya
Published in
4 min readJul 19, 2020

To statistically test if population proportions of two groups are significantly different.

Photo by Evgeny Smirnov on Unsplash

We get the data on which we will be testing the hypothesis. The data being considered here is the famous Titanic data-set which can be found on Kaggle.

Importing the libraries:

import numpy as np
import pandas as pd
import scipy.stats.distributions as dist

Parameter of Interest

We read the data and select only two of the relevant columns from the data set. The column ‘Survived’(1 if the individual survived the titanic disaster, else 0) and the variable ‘Sex’(indicates the gender of the individual). We set the parameter of interest as the difference in proportions of the individuals who survived the Titanic disaster, based on their gender. Formulation of the parameter of interest is :

Parameter of Interest
data = pd.read_csv('train.csv')
data['Survived'] = data['Survived'].map({1:'Survived',0:'Not Survived'}) #Encode the values
data = data.dropna() #Drop the nan values
contingency_table = pd.crosstab(data.Survived,data.Sex) #Contingency Table
contingency_table

Output:

Contingency Table for the data

The above table is called as a contingency table and gives us a cross-tab view of the counts pertaining to each category. Let us convert the above numbers into proportions, to get an idea!

pd.crosstab(data.Survived,data.Sex).apply(lambda r:r/r.sum(),axis=0)

Output:

Contingency Table(Proportions)

We can very well judge(informally) that there is a significant difference between proportions of survived females to that of survived males. Let us prove that statistically !

Hypothesis

The hypothesis we test is if the above difference of proportions will be zero or not equal to zero(a two tailed test will be required). Mathematically, we represent it as:

Hypothesis(Null Vs Alternative)

Setting up the Significance Level

We preset the significance level to be 0.10(a standard for two tailed tests)

Assumptions

  1. Assumption of independent random samples.
  2. Optimal size of samples(making sure that there are at least 10 data points from each class with respect to the total proportion)

We have to make sure that the sample size is large enough, i.e we have at least 10 survivals and 10 casualties in each category. Putting it mathematically,

The corresponding code for it:

total_proportion_survived = (data.Survived == "Survived").mean()num_female = data[data.Sex=="female"].shape[0]
num_male = data[data.Sex=="male"].shape[0]
assert num_female*total_proportion_survived>10, "Assumptions not met"
assert num_male*total_proportion_survived>10, "Assumptions not met"
assert num_female*(1-total_proportion_survived)>10, "Assumptions not met"
assert num_male*(1-total_proportion_survived)>10, "Assumptions not met"

Standard Error for Population Proportions

The sample standard error for the difference in population is(the code follows):

Standard error for difference in Population Proportions.
#This table helps us to calculate the SE.
prop = data.groupby("Sex")["Survived"].agg([lambda z: np.mean(z=="Survived"), "size"])
prop.columns = ['proportions_survived','total_counts']
prop.head()
#Calculating standard error

variance = total_proportion_survived * (1 - total_proportion_survived)
standard_error = np.sqrt(variance * (1 / prop.total_counts.female + 1 / prop.total_counts.male))
print("Sample Standard Error",standard_error)
Computed Sample Standard Error

Getting the Test Statistic

After we check that the assumptions hold true, its the time to calculate the test statistic for this hypothesis test.

Test Statistic

The hypothesized estimate is 0(null hypothesis) and we have the proportions of survived males and females to get the best estimate(the difference).

# Calculate the test statistic 
best_estimate = (prop.proportions_survived.female - prop.proportions_survived.male)
print("The best estimate is",best_estimate)
hypothesized_estimate = 0
test_stat = (best_estimate-hypothesized_estimate) / standard_error
print("Computed Test Statistic is",test_stat)
Best Estimate and Test Statistic Computation

This implicates that our sample proportion difference estimate is ~16.2 standard errors above our hypothesized estimate !

Now we need to convert it a p-value and test our hypothesis based on significance level(set by us).

P-Value Computation

We get the p-value computed based on the standard normal distribution(zero mean and unit standard deviation) often called a z-test. We get the p values from both the tails of the distribution(depends on alternative hypothesis).

# Calculate the  p-value
pvalue = 2*dist.norm.cdf(-np.abs(test_stat)) # Multiplied by two indicates a two tailed testing.
print("Computed P-value is", pvalue)
P-value from Standard Normal Distribution

Inference

We can clearly see that p-value is way lesser than the significance level of 0.10. So we can safely reject the null hypothesis in favour of the alternative hypothesis. We infer that the difference in proportions between that of survived females to that of survived males is significant and certainly not equal to zero(females might had had better access to lifeboats).

References

  1. Khan Academy Hypothesis Testing of Proportions
  2. Statistical Hypothesis Testing Wiki

3. P-value calculation for z-test(Python)

--

--