Bootstrap Confidence Intervals in Python

Wenjun
Wenjun
Nov 5 · 2 min read

Using bootstrapping to construct confidence interval of the mean difference in Python


Bootstrapping means random sampling with replacement. It creates tons of resamples with replacement from a sample and computes the effect size of interest on each of these resamples. We can use bootstrapping to estimate the confidence interval of the mean difference between two samples.

You may wonder why we don’t use t-test for this task. It is because one of the assumptions of t-test is that the data is normally distributed. However, sometimes we cannot or don’t want to make an assumption about the distribution.


I created a function in Python for construct CI of the mean difference with bootstrapping. Here it is —

import numpy as npdef bootstrap_ci(df, variable, classes, repetitions = 1000, alpha = 0.05, random_state=None): 

df = df[[variable, classes]]
bootstrap_sample_size = len(df)

mean_diffs = []
for i in range(repetitions):
bootstrap_sample = df.sample(n = bootstrap_sample_size, replace = True, random_state = random_state)
mean_diff = bootstrap_sample.groupby(classes).mean().iloc[1,0] - bootstrap_sample.groupby(classes).mean().iloc[0,0]
mean_diffs.append(mean_diff)
# confidence interval
left = np.percentile(mean_diffs, alpha/2*100)
right = np.percentile(mean_diffs, 100-alpha/2*100)
# point estimate
point_est = df.groupby(classes).mean().iloc[1,0] - df.groupby(classes).mean().iloc[0,0]
print('Point estimate of difference between means:', round(point_est,2))
print((1-alpha)*100,'%','confidence interval for the difference between means:', (round(left,2), round(right,2)))

Input

df: a data frame that includes observations of the two sample
variable: the column name of the column that includes observations
classes: the column name of the column that includes group assignment (This column should contain two different group names)

These are some secondary parameters.
repetitions: number of times you want the bootstrapping to repeat. The default is 1000. (I set it to 1000 to get the result faster, you properly want to increase this when you run it.)
alpha: likelihood that the true population parameter lies outside the confidence interval. The default is 0.05.
random_stata: enable users to set their own random_state, default is None.

A .py file and a notebook associated with this function can also be found on my Github.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade