A/B Testing with Bootstrapping

Enozeren
Getir
Published in
6 min readDec 21, 2022

A/B testing is an essential process for new features, products, or designs for web-based companies. To interpret the results of an A/B test for a final decision, there are various statistical approaches. The well-known statistical approaches are Frequentist, Bayesian, and Sequential testing. All three approaches have pros and cons, however, they are sometimes hard to understand and implement because of statistical requirements. In this article, I‘ll show you another approach that is easy to understand and implement–bootstrapping for A/B testing!

This article has 3 parts

  1. What is bootstrapping and how it works for an A/B test?
  2. Implementation of the method in a case study
  3. Pros and Cons of the bootstrapping method

1. What is bootstrapping and how it works for an A/B test?

Bootstrapping is a powerful tool that is used for quantifying uncertainty for an estimator (James et al.). With this method, we can resample a sample to create simulated samples. And with using these simulated samples, we can estimate some statistical metrics such as p-value and confidence intervals easily instead of using traditional (sometimes complex) hypothesis testing formulas.

To perform bootstrapping you should simply use sampling with replacement methodology. To make it more concrete let me illustrate the bootstrapping process in an example below.

Picture 1 — Bootstrapping Process Example

In Picture 1, on the left-hand side, we have a sample that has 5 data points. To simulate a “bootstrap sample” we just take samples with replacements for the same size as our original sample. Then we do this by N times and obtain N bootstrap samples. The mean value of each bootstrap sample creates the distribution for the mean of our original sample.

In this process, we assume our sample is our population and then take “bootstrap samples” from that. Therefore, when we take enough bootstrap samples the distribution of the means of bootstrap samples forms a normal distribution (see Central Limit Theorem).

This approach is useful for A/B tests. We can derive a distribution for the difference of means of groups A and B, then we can check the distribution that distribution if most of the time difference is greater than 0. And this method can be called “Bootstrapping for A/B testing”. Let’s make this method more concrete in the next part.

2. Implementation of the method in a case study

In this part let’s make up an A/B test scenario to make the Bootstrapping method easier to understand (feel free to follow up the python code and visuals from streamlit app in my GitHub repo).

Assume we have a mobile game where our clients are exposed to ads every 3 minutes. We want to test if our clients spend more in-game-time (IGT) when we show them ads every 4 minutes. We name the clients who see ads in 3 minutes as “Control”, and ads in 4 minutes as “Variant” groups. We will try to compare those groups with the bootstrapping method.

After the experiment, we can see some parts of the data in Table 1 below.

Table 1–10 data points from the experiment (IGT is in minutes)

When we analyze our experiment data we see Picture 2.

Picture 2 — Summary of the experiment

So, our clients in the Variant Group have 2.7% more IGT when compared with the Control group. Is this a statistically significant difference? Here, we need to answer this question: “What is the probability of the Variant group being better than the Control group?” If that probability is higher than 95%, we assume the difference is significant.

With the function below we create the distribution for differences of the means for the Variant and Control groups. Then, we calculate the probability that the Variant is better than the Control.

import numpy as np
import pandas as pd
import streamlit as st
import matplotlib.pyplot as plt
from sklearn.utils import shuffle

def p_value_with_bootstrapping(control_df: pd.DataFrame, variant_df: pd.DataFrame, col_name: str, number_of_samples=10_000, alpha=0.05, mde=0) -> float:

control_values = control_df[col_name].values
variant_values = variant_df[col_name].values

difference = round(np.mean(variant_values) - np.mean(control_values), 3)

difference_list = []

number_of_samples = st.slider('Number of Bootstrap Samples:', min_value=20, max_value=20_000)

for i in range(number_of_samples):

first_sample_mean = np.mean(np.random.choice(a=control_values, size=len(control_values), replace=True))
second_sample_mean = np.mean(np.random.choice(a=variant_values, size=len(variant_values), replace=True))

difference_list.append(second_sample_mean - first_sample_mean)

fig, ax = plt.subplots()
ax.hist(difference_list, bins=50)
plt.xlabel('Avg IGT difference between bootstrap samples')
plt.ylabel('# of observed difference')
st.pyplot(fig)
st.caption('The distribution of differences between bootstrap samples')

st.write(f'Observed Difference (Variant - Control) = {difference}')
p_value = round(1 - len(list((filter(lambda x: x > 0 + mde, difference_list ))))/number_of_samples, 3)
st.write(f"P-Value = {p_value}")

if p_value <= alpha:
st.write(f"We can say that {col_name} is improved at the level of our MDE since P-Value({p_value}) <= Alpha({alpha})")
else:
st.write(f"We can NOT say that {col_name} is improved at the level of our MDE since P-Value({p_value}) > Alpha({alpha})")

confidence_interval_for_difference = {'Lower Bound': np.quantile(difference_list, q=alpha/2), 'Upper Bound': np.quantile(difference_list, q=1-alpha/2)}
st.write(f"The {(1-alpha)*100}% confidence interval for Avg IGT Difference between Variant & Control Groups:")
st.write(confidence_interval_for_difference)


return p_value

When we check the distribution for the mean difference, we see that most of the values are greater than 0.

Picture 3 — Distribution of differences in means

We can calculate the probability that the difference > 0 with a simulation in the “p_value_with_bootstrapping” function above. That probability turns out to be 0.7%. We can say that the Variant has a higher mean than the Control with a probability of 99.3% (1–0.7%). Therefore, we conclude our test as showing ads in 4 minutes instead of 3 minutes increases the in-game time of clients.

3. Pros and Cons of the bootstrapping method

PROS

  1. We do not make any assumptions about the distribution of our metric
    We use a simulation technique, therefore, we don’t have to know a lot about the distribution of our data
  2. No required sample size for the test.
    There is no required sample size since Central Limit Theorem is satisfied with small sample sizes (around 100)
  3. Easy to implement and understand
    We do not need a lot of statistical background to understand, implement and interpret the results of this method since bootstrapping is an easy and intuitive method.

CONS

  1. With big data, this bootstrap simulation approach has a computational burden
    For companies that have huge amounts of customers like Netflix or Spotify, millions of customers might be in the experiment. Since bootstrapping method relies on simulations, millions of data points might have a big computational burden. However, there are some more advanced approaches to bootstrapping methods to work around that computational burden. If you are curious about them you can check one of Spotify’s articles here.
  2. Peeking the results during the test multiple times might increase the Type 1 error rate.
    Peeking is a common mistake in Frequentist approaches and binding for Bootstrapping method.

I hope this article helps interpret the results of your A/B tests. If you have any suggestions or questions, feel free to write to me via LinkedIn.

Resources:

  1. James, Gareth, et al. An Introduction to Statistical Learning: With Applications in R. Springer Science and Business Media, 2013.

My LinkedIn Profile

GitHub link for the Code

--

--