Criteo R&D Blog
Published in

Criteo R&D Blog

Photo by Jens Lelie on Unsplash

Why your A/B-test needs confidence intervals

When can you conclude on this A/B-test?

How to conclude on the significance of an A/B-test

Some special cases for additive metrics

The metric contains only zeros or ones

Experimental data (source: synthetic data)
Expected distribution (source: synthetic data)

General case for an additive metric

The metric has no special property

Bootstrap method (non-distributed)

Bootstrap method: A practical walk-through

  1. First, decide on the number k of bootstraps you want to use. A reasonable number should be at least a hundred, depending on the final use of the bootstraps.
  2. For every bootstrap to be computed, recreate a data set from the initial one with random sampling with replacement, this new data set having the same number of examples than the reference data set.
  3. Compute the metric of interest on this new data set.
  4. Use the k values of the metric either to conclude on the statistical significance of the test.

How to use bootstraps in the case of an A/B-test

How to conclude on the significance of the A/B-test?

An illustrated example

Empirical distribution of the number of buyers for populations A and B. 1000 bootstraps were used. (source: synthetic data)
Histogram of difference of number of buyers between the two populations. Color indicate if bootstrap is positive or negative. (source: synthetic data)

Bootstrap method (distributed)

  • The data set is distributed across machines and too big to fit into a single’s machine memory
  • The data set is distributed across machines and even though it could fit into a single machine, it is costly and impractical to collect everything on a single machine
  • The data set is distributed across machines and even the values of an observation are distributed across machines. For instance, it happens if the observation we are interested in is “how much did the user spend for all his purchases in the last 7 days” and if our data set contains data at purchase-level. To apply the previous methods, we would first need to regroup our purchase-level data set into a user-level data set, which can be costly.
  • An approximation error is made due to the fact that the number of bootstraps is finite (the bootstrap method converges for an infinite number of resampling)
  • An approximation error that induces a bias in the measure of quantiles is made due to the fact that the resampling is done approximately. This comes from the fact that for a given resampling, the sum of the weights is not exactly equal to n (this is only true on average). For further reading, check out this great blog post and the associated scientific paper.


Comparative pro/cons

Source: authors


About the authors



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store