# Bootstrap method (non-distributed)

## Bootstrap method: A practical walk-through

1. First, decide on the number k of bootstraps you want to use. A reasonable number should be at least a hundred, depending on the final use of the bootstraps.
2. For every bootstrap to be computed, recreate a data set from the initial one with random sampling with replacement, this new data set having the same number of examples than the reference data set.
3. Compute the metric of interest on this new data set.
4. Use the k values of the metric either to conclude on the statistical significance of the test.

## An illustrated example Empirical distribution of the number of buyers for populations A and B. 1000 bootstraps were used. (source: synthetic data) Histogram of difference of number of buyers between the two populations. Color indicate if bootstrap is positive or negative. (source: synthetic data)

# Bootstrap method (distributed)

• The data set is distributed across machines and too big to fit into a single’s machine memory
• The data set is distributed across machines and even though it could fit into a single machine, it is costly and impractical to collect everything on a single machine
• The data set is distributed across machines and even the values of an observation are distributed across machines. For instance, it happens if the observation we are interested in is “how much did the user spend for all his purchases in the last 7 days” and if our data set contains data at purchase-level. To apply the previous methods, we would first need to regroup our purchase-level data set into a user-level data set, which can be costly.
• An approximation error is made due to the fact that the number of bootstraps is finite (the bootstrap method converges for an infinite number of resampling)
• An approximation error that induces a bias in the measure of quantiles is made due to the fact that the resampling is done approximately. This comes from the fact that for a given resampling, the sum of the weights is not exactly equal to n (this is only true on average). For further reading, check out this great blog post and the associated scientific paper.