Hypothesis Testing and Confidence Intervals using Python

Gaurang Mehra
Gaurang Portfolio
Published in
4 min readApr 6, 2022

(A simulation based approach)

Outline

  1. Given sample data how do you do exploratory data analysis to get an intuitive understanding of the relationship between variables
  2. Given sample data generate confidence intervals using boot strapping/simulation
  3. Formulate a hypothesis and test it at a given confidence level (alpha). Calculate p values.

Datasets

The data set used for this is the sample dataset containing median house values for California districts and key characteristics like median income of the district, proximity to ocean, median house age etc.

Project objective

  1. Explore data and understand what factor most affects the median house value in a district
  2. Given this sample calculate the 95% CI for the correlation using a simulation/bootstrap approach
  3. Construct a hypothesis test to get the associated p value for this correlation using the simulation/bootstrap approach

Step1-: Import the data and understand the correlations with the median house value.

  • This should be fairly simple if use pandas to import the data and then use seaborn heatmaps to visualize the correlations below
# Importing the necessary modulesimport pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Importing the dataset
house=pd.read_csv('housing.csv')
# Running the inbuilt corr method for panadsmatrix=house.corr()
plt.figure(figsize=(14,9))
sns.heatmap(matrix,cmap='Reds',annot=True)
plt.show()
Fig 1.1- Visualizing correlations in the data

Insight-: We find that the median house value is the most linearly correlated with the median income as we would expect (Pearson correlation coefficient of 0.69

Step 2-: Construct and visualize a 95% confidence interval around this estimated correlation

  • We want to understand what the range of this correlation could possibly be rather than a point estimate. If 95% of the values in the range fall between 0.2 to 0.69 then we probably got this high a result by chance. However if the values are tightly clustered around 0.6 to 0.7 then we can say with a high degree of confidence that this correlation is significant.
  • To create this range of correlations we would randomly pick ordered sets of house values and income from the data and calculate the correlation. Every time, based on the set of values picked, the correlation coefficient value would be slightly different. We would get a n number of values and we can calculate the confidence interval from this.
  • The entire process above is akin to going out and getting another random sample of data and calculating the correlation. This is because we are randomly picking ordered sets of data and on each iteration we would pick a different ordered set of median house values and median incomes

Creating the function to resample data

# Creating the function to resample and calculate the correlation coefficient n times with the default value being 1000def replicate(data1,data2,sims=1000,stat=np.corrcoef):
rep=np.empty(sims)
for i in range(sims):
inds=np.random.choice(a=data1.index,size=len(data1),
replace=True)
rep[i]=stat(data1.loc[inds],data2.loc[inds])[0,1]
return(rep)

The above function can be generalized to any 2 variables and returns by default the resampled correlations from the underlying data

Calculating the 95% Confidence Interval

corr_resamp=replicate(house.median_income,house.median_house_value,np.corrcoef,1000)
np.percentile(rep,[2.5,97.5]).round(3)

Result -: 95% CI for the correlation coefficient is (0.679, 0.697). Fairly significant correlation

Visualizing the Confidence interval

  • The best way to visualize the result is as a histogram which shows the distribution of the values of the 1,000 simulations/resamples that we just ran
# Creating the figure size
plt.figure(figsize=(12,9))
# Plotting the histogram_=plt.hist(corr_resamp,color='#FF8E33')
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.title('Distribution of possible correlations',fontsize=16)
# Creating the annotation for the figure
plt.annotate('Low 95 CI',(0.679,0),xytext=(0.679,100),arrowprops=dict(facecolor='black'),fontsize=12)
plt.annotate('High 95 CI',(0.697,0),xytext=(0.697,100),arrowprops=dict(facecolor='black'),fontsize=12)#Exporting the image
plt.tight_layout()
plt.savefig('Histogram.png')
Fig 1.2- Distribution of correlation coefficients (n_sims=1,000)

As you can see most of the results are centered around 0.67 to 0.69. We can repeat this with simulations set to 10,000

Step 3-: Construct a formal hypothesis test and calculate a p value

  • The null hypothesis is that median house value and median income are not correlated.
  • Test statistic is the observed Pearson correlation coefficient. In this case 0.69
  • Find the probability that you get the sample correlation under the null hypothesis. That is what is the probability of getting a correlation of 0.69 if house value and income are not related
  • If this is below the alpha of 5% then we can say with 95% confidence interval that the correlation is significant

This is very similar to the approach in the previous step, but in this case we first permute or “scramble” the 2 variable arrays

This creates the distribution under which we test the null hypothesis (that is the 2 variables are no longer related as we have“scrambled” the arrays)

We can then run the simulation 10K times, calculate the correlation coefficient 10K times, check to see how many times we get the test statistic (correlation coefficient=0.69) in the 10K simulations.

# Function to permute the variable arrays and simulate n timesdef hyp_test(data1,data2,sims=1000,stat=np.corrcoef):
tester=np.empty(sims)
for i in range(sims):
x=np.random.permutation(data1)
tester[i]=stat(x,data2)[0,1]
return(tester)
# Calling the function and storing the 10K values of the correlation # coefficient in an array called testertester=hyp_test(house.median_house_value,house.median_income,10000)# calculating the pvalue
sum(tester>=stat)

In this case the p value is 0 since 0 out of 10K simulations did we see a correlation value greater than 0.69

When we visualize the results below we see that there is an almost infinitesimally small chance that we could have got a correlation as high as 0.69 by chance (that is under the null hypothesis that the median house value and median income are not correlated). In other words the correlation is statistically significant with a 95% confidence.

Fig 1.3- Distribution of correlation coefficient under the null hypothesis (n_sims=10,000)

--

--

Gaurang Mehra
Gaurang Portfolio

Deeply interested in Data Science, AI and using these tools to solve business problems.