Introduction
Bootstrap can estimate the accuracy of a statistic. A use case is to estimate a confidence interval of a mean. When there are no nice formulas out there to construct the confidence interval for a more complicated statistic, say median or 75th percentile, bootstrap can help as well.
Bootstrap can estimate the accuracy of a model. A use case is to estimate the standard errors of the coefficients of a linear regression. When there are no nice formulas out there for the standard error of a more complicated model, say polynomial regression, bootstrap can help too.
In this post, we will focus on the first use case, estimate a confidence interval of a mean, using a concrete Python example to show you how bootstrap can provide the flexibilities.
What is a bootstrap?
Bootstrap is a statistical tool to assess the accuracy of an estimate or a model. It does so by repeatedly sampling data from the original dataset with replacement. Keep in mind that the size of the bootstrap sample is the same as that of the original dataset. The keyword is replacement. It means the same “B” can be chosen multiple times from the original dataset and add to the bootstrap sample as illustrated in the above image (“B”, “B”, “B”).
Bootstrap a 95% confidence interval
Import libraries
# scientifc computing
import pandas as pd
import numpy as np
# generalized linear models
import statsmodels.api as sm
from statsmodels.formula.api import glm
Load the insurance dataset
# load the insurnace dataset
df = pd.read_csv('insurance.csv', index_col = 0)
# include only "age" as the explanatory variable and "charges" as the response variable
df = df[['age', 'charges']]
# take a look the first five rows
df.head()
This is what the data looks like
Create a bootstrap sample by repeatedly sampling data from the original dataset with replacement.
# create a bootstrap sample of sample_size with replacement
df_bootstrap_sample = df['charges'].sample(n = sample_size, replace = True)
Calculate the sample statistic. Here we choose mean as the sample statistic. Feel free to calculate other statistic of you choice such as median or 75th percentile by tweaking the code below.
# calculate the bootstrap sample mean
sample_mean = df_bootstrap_sample.mean()
Repeat the above two steps many times, 1000 in this example, to get the distribution of the sample statistic.
def create_bootstrap_samples(sample_size = len(df), n_samples = 1000):
# create a list for sample means
sample_means = []
# loop n_samples times
for i in range(n_samples):
# create a bootstrap sample of sample_size with replacement
df_bootstrap_sample = df['charges'].sample(n = sample_size, replace = True)
# calculate the bootstrap sample mean
sample_mean = df_bootstrap_sample.mean()
# add this sample mean to the sample means list
sample_means.append(sample_mean)
return pd.Series(sample_means)
Plot the distribution of the bootstrap sample means.
# create bootstrap samples
sample_means = create_bootstrap_samples()
# plot the distribution
sample_means.plot(kind = 'hist', bins = 20, title = 'Distribution of the Bootstrap Sample Means')
Get the 95% confidence interval from the distribution.
# get the lower bound of the confidence interval
ci_lower = sample_means.quantile(q = 0.025)
# get the upper bound of the confidence interval
ci_higher = sample_means.quantile(q = 0.975)
Plot the confidence interval.
# plot the distribution
sample_means.plot(kind = 'hist', bins = 20, title = 'Confidence Interval of the Sample Means')
# add the lower bound and upper bound of the confidence interval
plt.axvline(ci_lower, color = 'red', ls = '--')
plt.axvline(ci_higher, color = 'red', ls = '--')
Nice! Now we have a 95% confidence interval for the mean of the insurance charges. We can make this claim: we are 95% confident that the true insurance charge is between $12,603 to $13,970.
References
- An Introduction to Statistical Learning with Applications in R
- Bootstrapping Main Ideas!!!: https://youtu.be/Xz0x-8-cgaQ
- Using Bootstrapping to Calculate p-values!!! https://youtu.be/N4ZQQqyIf6k