Testing a Difference in Population Proportions in Python
To statistically test if population proportions of two groups are significantly different.
We get the data on which we will be testing the hypothesis. The data being considered here is the famous Titanic data-set which can be found on Kaggle.
Importing the libraries:
import numpy as np
import pandas as pd
import scipy.stats.distributions as dist
Parameter of Interest
We read the data and select only two of the relevant columns from the data set. The column ‘Survived’(1 if the individual survived the titanic disaster, else 0) and the variable ‘Sex’(indicates the gender of the individual). We set the parameter of interest as the difference in proportions of the individuals who survived the Titanic disaster, based on their gender. Formulation of the parameter of interest is :
data = pd.read_csv('train.csv')
data['Survived'] = data['Survived'].map({1:'Survived',0:'Not Survived'}) #Encode the values
data = data.dropna() #Drop the nan valuescontingency_table = pd.crosstab(data.Survived,data.Sex) #Contingency Table
contingency_table
Output:
The above table is called as a contingency table and gives us a cross-tab view of the counts pertaining to each category. Let us convert the above numbers into proportions, to get an idea!
pd.crosstab(data.Survived,data.Sex).apply(lambda r:r/r.sum(),axis=0)
Output:
We can very well judge(informally) that there is a significant difference between proportions of survived females to that of survived males. Let us prove that statistically !
Hypothesis
The hypothesis we test is if the above difference of proportions will be zero or not equal to zero(a two tailed test will be required). Mathematically, we represent it as:
Setting up the Significance Level
We preset the significance level to be 0.10(a standard for two tailed tests)
Assumptions
- Assumption of independent random samples.
- Optimal size of samples(making sure that there are at least 10 data points from each class with respect to the total proportion)
We have to make sure that the sample size is large enough, i.e we have at least 10 survivals and 10 casualties in each category. Putting it mathematically,
The corresponding code for it:
total_proportion_survived = (data.Survived == "Survived").mean()num_female = data[data.Sex=="female"].shape[0]
num_male = data[data.Sex=="male"].shape[0]assert num_female*total_proportion_survived>10, "Assumptions not met"
assert num_male*total_proportion_survived>10, "Assumptions not met"
assert num_female*(1-total_proportion_survived)>10, "Assumptions not met"
assert num_male*(1-total_proportion_survived)>10, "Assumptions not met"
Standard Error for Population Proportions
The sample standard error for the difference in population is(the code follows):
#This table helps us to calculate the SE.
prop = data.groupby("Sex")["Survived"].agg([lambda z: np.mean(z=="Survived"), "size"])
prop.columns = ['proportions_survived','total_counts']
prop.head()
#Calculating standard error
variance = total_proportion_survived * (1 - total_proportion_survived)
standard_error = np.sqrt(variance * (1 / prop.total_counts.female + 1 / prop.total_counts.male))
print("Sample Standard Error",standard_error)
Getting the Test Statistic
After we check that the assumptions hold true, its the time to calculate the test statistic for this hypothesis test.
The hypothesized estimate is 0(null hypothesis) and we have the proportions of survived males and females to get the best estimate(the difference).
# Calculate the test statistic
best_estimate = (prop.proportions_survived.female - prop.proportions_survived.male)
print("The best estimate is",best_estimate)
hypothesized_estimate = 0
test_stat = (best_estimate-hypothesized_estimate) / standard_error
print("Computed Test Statistic is",test_stat)
This implicates that our sample proportion difference estimate is ~16.2 standard errors above our hypothesized estimate !
Now we need to convert it a p-value and test our hypothesis based on significance level(set by us).
We get the p-value computed based on the standard normal distribution(zero mean and unit standard deviation) often called a z-test. We get the p values from both the tails of the distribution(depends on alternative hypothesis).
# Calculate the p-value
pvalue = 2*dist.norm.cdf(-np.abs(test_stat)) # Multiplied by two indicates a two tailed testing.
print("Computed P-value is", pvalue)
Inference
We can clearly see that p-value is way lesser than the significance level of 0.10. So we can safely reject the null hypothesis in favour of the alternative hypothesis. We infer that the difference in proportions between that of survived females to that of survived males is significant and certainly not equal to zero(females might had had better access to lifeboats).
References