Bootstrapping in Statistics
The concept explained in laypeople terms
Say that you want to know the average height of all people living in Chicago. Assuming that you don’t have the time to survey every single Chicago resident, you can randomly sample a group Chicago residents (say 500 people) to estimate your desired statistic.
- Population: all Chicago residents, ~2.7 million people
- Desired Statistic: mean height of population (all Chicago residents)
- Sample: 500 randomly sampled Chicago residents
Assume that you calculate the mean height of the sample of 500 Chicago residents to be 5 ft 5 in. How good of an estimate is this of the entire Chicago population? This is where the bootstrap comes in.
To use bootstrapping, we have to assume that the sample is an adequate model of the original population of interest.
We treat our sample as its own little population. We then re-sample (with replacement) multiple times from the original sample of 500 residents. Then, using the differences between the re-sampled mean heights and the known mean height of the 500 residents, we can infer how accurate of an estimate that the 5 ft 5 in (mean height of the 500 residents) is of the entire population of Chicago residents.
To generalize this, bootstrapping is used to provide an estimate of the sampling distribution of the statistic in question. This should not be confused with estimating the statistic or the original population distribution.
“Bootstrapping is the practice of estimating properties of an estimator (such as its variance) by measuring those properties when sampling from an approximating distribution.” — Wikipedia