What is Bootstrapping in Machine learning ?

Abhijeet Pokhriyal
Analytics Vidhya
Published in
7 min readSep 21, 2021


Most of statistics deals with comparing two things and determining if they are different in reality or if it’s just that we randomly observed a difference in the sample we gathered but in reality there is no difference. This is what makes statistics so exiting. For example is iphone13’s performance really better than iphone10’s ? Or is the machine learning model that I have built really better than the existing system in place ?

For the iphone example how would you go about it ? Suppose you measure performance based on the milliseconds it takes to open Uber App. So you would open the app a bunch of times and note the time taken each time for both the phones.

iphone13 = [120,100,110,90,115,123,114,103,111,107,103,118,113,112,121,110,117,115,119,108,124,111,112,113,113,11,112,110,109,109]

iphone10 = [123,119,115,112,100,121,125,122,125,123,128,122,119,118,112,130,122,128,112,118,112,126,122,123,127,122,123,119,129,122]

Suppose this is what you observed and now wanted to find out if there is an actual difference between the performance.

Naive Approach

One naive approach would be to just take the means and compare

iphone13_mean =108.43

iphone10_mean =120.63

now you could conclude right here that iphone13 performs better.

BUT can you be sure that this difference would be observed again if you took another set of measurements ? Can you safely go and tell you manager that yes there is a difference in performance ? What if they conduct their own experiment and find that both models are actually very similar ?

How can you be sure ?

Traditional Approach

Second approach that comes in is a traditional statistical test

In this approach you

  1. Compute the t-statistic
  2. Look up the theoretical distribution which is in the t-table
  3. If you find that the probability of observed value of T is in the upper 5% (quite arbitrary threshold) of the theoretical distribution , then you conclude that the observed difference is not because of random chance

Now for people who have just been introduced to statistical tests, these tests feel arbitrary and that is because they usually act as black boxes and use poorly explained thresholds like 5%

Question comes : can we do better?


Bootstrapping was proposed by Bradley Efron (i guess not related to Zac Efron) in 1979 [EFRON_1979]. He noted that the traditional approaches are parametric and rely on normal distribution theory

“The most obvious defect of this [traditional] procedure is the use of normal distribution theory to determine the critical value at which the observed becomes “significant.” Nonparametric statistics, mainly developed since 1950, gives an answer that does not depend upon normal theory:”[EFRON_1979]

What did he propose instead?

Well he said if we really wanted to be sure that iphone13 is better than iphone10. Then let’s first pretend they aren’t and all the measurements we have made are for, let’s say , a base iphone model called iphone_base that Apple cunningly re-brands and sells as both iphone10 and iphone13.

so now we have

iphone_base= [120, 100, 110, 90, 115, 123, 114, 103, 111, 107, 103, 118, 113, 112, 121, 110, 117, 115, 119, 108, 124, 111, 112, 113, 113, 11, 112, 110, 109, 109, 123, 119, 115, 112, 100, 121, 125, 122, 125, 123, 128, 122, 119, 118, 112, 130, 122, 128, 112, 118, 112, 126, 122, 123, 127, 122, 123, 119, 129, 122]

Now if we consider all possible equal splits (because we had equal number of measurements for each phone) of this iphone_base , then only one out of

N = 60! / (30! 30!) would correspond to what we had observed. Rest all combinations would have mix of measurements of both iphone13 and iphone10.

In each of those N combinations, we calculate the difference between the (mean of split A) and (mean of split B). Since we have mixed the measurements we should expect the differences be close to zero except in the case that we actually observed. If the observed difference is in 5% of all N differences calculated then we can say that observed difference is statistically significant.

Now the problem is 60! / (30! 30!) is more than trillion combinations, that too for just 60 values!

As Effron put it “The nonparametric method pays a stiff computational price for its freedom from normal distribution theory. ” YEAH let’s call it stiff.

BUT he did drive home the point that if we are trying to be certain about our conclusions we could just throw compute at it instead of making parametric assumptions about it.

Okay we get it, here is a cool method that is really not practical so basically a dud and we have to go back to those black box t-tests right ?

NO. This is where things get interesting and stochasticity comes to the rescue.

Idea being what if we don’t evaluate all possible combinations but a selected number (n) of randomly selected combinations where n <<<< N ? Will that work ?


The key here is random selection with replacement.

For the data we have suppose we decide n= 10,000.

Then we sample from our observations with replacement. Split it into 2 equal parts and calculate and log the difference between the means of the two samples.

What does this bootstrapped difference look like ?

This is what we expected right ? Because we had combined the measurements from two iphones we expect that on average a random sample would include measurements from both the phones and the difference in mean should be close to zero.

Now what we had observed earlier was

iphone13_mean =108.43

iphone10_mean =120.63

Now the difference here is ~12 milliseconds. How likely are we to observe this difference in the graph above ? Quite unlikely, implying that the performances are indeed different. Another thing to note is that had the observed difference been ~2 milliseconds, then looking at the above graph we wouldn’t have been able to conclude that those were actually different because observing 2 milliseconds difference is quite likely according to above graph/distribution.

Instead of eyeballing at the graph we can make an assumption that the data is normally distributed with mean 0 and std 4.04 (calculated from the bootstrapped differences data). Now using Cumulative Distribution Function of Normal Distribution we can find the exact probability of obtaining +- 12 milliseconds.

Now this value comes around 0.0025, so there is only 0.25% chance of having observed this difference of 12 milliseconds had these observations really been for a real iphone_base model.

Why does bootstrapping work ?

Oh boy, you really are curious aren’t you ?

Let me try and fail.

Imagine you had access to all iphone13 and iphone10 ever manufactured and you had taken these measurements for each of those phones right after they were manufactured and ready to be shipped. So basically you have access to every possible measurement there is of the performance.

Now since you have all possible measurements, even the naive approach discussed above will do. Just take all possible differences (if you can collect all iphones, i bet you can do this too. Play along) and take their mean and be done with it.

Now let’s say you draw random samples from all possible difference calculations. Then for each random sample you would get a slightly different estimate (mean) of difference compared to the naive approach. This difference is called sampling error and this distribution of estimate is called sampling distribution of estimate (see left side of chart below. where theta is difference in performance of iphone13 and iphone10. theta¹ , theta² …etc are difference in performance of iphone13 and iphone10 as observed in each random sample from the population)

So to be really sure there is a difference in iphone13 and iphone10 performance we can look at the sampling distribution and calculate how likely is it to observe the difference we observed in this sampling distribution.

BUT the problem is that NO we can’t collect all possible iphone13s and iphone10s and NO we cannot calculate all possible differences and NO we can’t keep drawing samples from the true population.

THIS is where bootstrapping comes in.

Boostrapping helps us estimate the standard error by drawing random samples with replacement from the ONE and ONLY sample we usually have in real life (see the right side of chart below).

The theoretical proof of this is beyond my explanation as it becomes too mathematical too quickly but can be found here for a delightful read ;) [SWANEPOEL]

How bootstrapping works

Have any further questions ? Feel free to reach out to me on LinkedIn






Abhijeet Pokhriyal
Analytics Vidhya

School of Data Science @ University of North Carolina — Charlotte