The Central Limit Thereom allows us the ability to assume a normal distribution, GIVEN we are able to replicate our experiment multiple times to achieve it. This is where bootstrapping comes in, allowing us to resample our current dataset a high amount of times to achieve a NORMAL distribution.
Below, I will cover the classic nonparametric bootstrap.
First, let’s load our data from the Starcraft dataset.
You can see the head of the Starcraft data above. For the purposes of this post, we will begin by extracting the APM (Actions per Minute) column, and graphing the distribution.
As you can see, the distribution is positively skewed, not resembling a normal distribution at all.
So, what should a bootstrap function look like? It should follow the following framework.
for specified number of bootstrap iterations
create a bootstrap sample by randomly selecting observations with replacement from your sample
(same size as sample)
calculate the statistic of interest on bootstrap sample
calculate lower and upper percentile bounds of bootstrap statistics according to threshold
A bootstrap function essentially, takes your current dataset and randomly resamples from the available data. It then calculates the statistic of interest from the random sample, saving the result in a list. The procedure is then repeated, appending the statistic result to the list, until the amount of iterations has been reached. Below, is a bootstrap function coded in Python, for 1000 bootstrapped sample iterations.
Now that we have our bootstrap function, let’s use it to calculate a bootstrapped sample from the APM data, and find our 95% confidence interval for the mean APM. Once your bootstrapped data is generated (below: stored in apm_boot), we can find the percentiles by using scipy.stats.scoreatpercentile (individual) or np.percentile to find the percentile values for multiple values at once.
Why do we assume a normal distibution when bootstrapping? If you were to graph the distribution of bootstrapped samples, you find that they generally tend to distribute normally the more times sampled.
The above function will randomly sample two separate samples and storing the correlation between the two generated samples.
When you graph the distribution of the sample, you find that when sampled a high amount of times, the distribution will approach normal. This is seen when graphing the results from above, as the iterations are high, at 10,000.
As you can see, the distribution is fairly normal, providing the evidence for our assumption, and allowing us to use bootstrapping in our analysis.