Introduction to Bayesian A/B testing in Python

Focus on the daily conversion rate.

Victor Cumer
VeepeeTech
Published in
9 min readDec 10, 2020

--

At veepee.com, we develop recommender systems to rank the 200+ sales banners of the website homepage. The Data Science team is continuously improving the recommendation algorithms on various perimeters by either adding / removing / refactoring features or modifying the model’s architecture and optimization methods. Thus, implementing A/B testing is an essential part of the online evaluation process to assert every move is made in the right direction. This introductory document aims to summarize the basic concepts of the Bayesian A/B testing approach and gives an example of application process.

In certain cases, the Bayesian approach may provide useful results faster than the frequentist method. It may also be relevant to reach conclusions with small volumes. In addition, if the theory behind the method is more complex then for the frequentist approach, the main results are easier to understand for the business:

  • Probability to chose the best variation.
  • Expected loss linked to the choice of one or another variant.

Most of this document’s content comes from the work of @Chris Stucchio, @David Robinson and the summaries that were made by @Blake Arnold and @Michael Frasco. The idea here is to use the approach on a concrete Veepee use case.

Usual A/B testing approach

Frequentist approach

For all the A/B tests that we run about the algorithms’ modifications, we often use a frequentist approach. Regarding the methodology, as for any common A/B test, we follow these basic steps:

  1. Define some different algorithm variants to compare.
  2. Declare a null hypothesis which always is: there no difference between the conversion rates (our target KPI for the study).
  3. Use an A/B test engine to assert that we get independent and representative groups of members that are exposed to the same variation for the entire A/B test duration, and with respect to the volume we defined.
  4. Gather the data by computing the sums of daily unique visitors n and daily unique buyers c. For instance, after N days of A/B test, we would get the daily conversion rate for a specific variation from equation 1.
  5. Compute the test statistic value z from Δλ = λB-λA (see equation 2).
    Statistic test choice: z-Test. For this study case, we are dealing with a proportion. In addition, we have very large populations for both variants.
  6. From z, compute the probability of obtaining a result at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct: p-value. Choose a threshold α (usually 5%) under which you would reject the null hypothesis if p-value < α.
    In other words, if we choose α = 0.05, and we compute a p-value of 0.03, we may conclude that the probability to observe such a delta under the null hypothesis (i.e. both variations behave similarly in terms of conversion rate) is 3%, which is too small to validate the null hypothesis → we can reject it and assume there is a significant difference between the conversion rates λA and λB.
equation 1
equation 2

Note: The p-value is not the probability that variation A is larger than variation B or B is larger than A. A p-value is the likelihood of a seemingly unlikely event happening in a world governed only by chance.

What about small, non-significant improvements?

This issue has been tackled in @Michael Frasco’s article about the power of Bayesian A/B testing. Here is a summary of his words:

Imagine that a data scientist wants to run an experiment that tests a new version of a model. After observing enough data, we find that the new model is only slightly better than the current model, leading to a p-value of 0.11. Under frequentist methodology, the proper procedure in this scenario is to keep the current model. However, since the new model is making better predictions than the current model, this decision is very unsatisfying and potentially costly […]. Sometimes the costs of implementing the new variation outweigh the small benefits.

In these kind of scenarios, Bayesian methodology is appealing because it is more willing to accept variants that provide small improvements. And after a large number of experiments, these marginal gains will accumulate on top of each other.

Bayesian approach

The Bayesian approach models each parameter (in our case the conversion rate) as a random variable with some probability distribution. Thus, it is all about finding the accurate probability density function (p.d.f.) of the conversion rate λ of each variant. These density functions enable us to compute the probability for the true conversion rate related to a variation to be in a specific interval. The Bayesian concept makes the link between the prior probability of observing a conversion rate value λ, and the posterior probability of observing this λ knowing the number of visitors n and buyers c (i.e. evidence) we got.

Bayes rule

In the case of conversion rate, a relevant density function to choose is the Beta distribution.

Beta distribution function

The Beta distribution f(x, a=1,b=1) can be used as the prior if we do not make any assumption about the conversion rate (it is equivalent to the uniform law over [0;1], which only assumes two prior observations: one conversion and one non-conversion). Then, working with the beta distribution, the posterior is very easy to compute while we gather some evidences n and c, as the update rule gives us: P(λ|n,c) = f(x, a+c,b+(n-c)). Figure 1 illustrates the uniform prior on the right, and updated posteriors at different stages of an A/B test on the left. We see how uncertain we can be about the conversion rate real value after only 28 visitors and how the density narrows around 0.33 with the number of visitors increasing: our certainty about where the true conversion rate value increases.

figure 1

By computing this density function at specific moments of the A/B test, we can then use the joint posterior P(λA, λB) = P(λA | nA, cA).P(λB | nB, cB) to compute quantities such as:

The probability that B is better than A (i.e. probability to make a mistake by choosing A)
The expected loss if we choose variant “?”
The magnitude of the error if we choose A

The expected loss (i.e. Ε[L]) takes into account both the probability that we’re choosing the worse variant via the probability density function (p.d.f.) P(λA, λB) and the magnitude of potential mistake via L(λA, λB, ?) with “?” = A or B which, for instance with “?” = A, gives 0 if λA > λB and λB-λA if not.

In case of close results in terms of conversion rate, the expected loss may help to decide whereas choosing A or B will lead to a large decrease in the conversion rate, or if the associated loss remains under an acceptable threshold.

Bayesian A/B testing process summary

  1. Define some different algorithm variations to compare.
  2. Use an A/B test engine to assert that we get independent and representative groups of members that are exposed to the same variation for the entire A/B test duration, and with respect to the volume we defined.
  3. Gather the data by computing the sums of daily unique visitors n and daily unique buyers c.
  4. Choose a prior and update the joint posterior p.d.f. with c and n.
  5. Since the expected loss for a variation is the average amount by which our metric would decrease if we chose that variant, define ε to be small enough that we are comfortable making a mistake of that magnitude.
  6. Compute:
    A. The probability that λB > λA.
    B. The expected losses.
  7. Knowing ε, the expected losses and P(λB > λA), make a decision about the winning variation.

Use cases

Let’s compare the results of both frequentist and Bayesian approaches on a sandbox A/B test. The daily values of c and n are directly written in the code in a python dictionary.

Data used for the studied A/B test

Frequentist approach

For this A/B test, the frequentist analysis led to the reject of the null hypothesis, but only after almost 60 days of A/B testing. We chose to make a two-tailed z test and the p-value needs to be inferior to 0.05 (the chosen threshold is 5%). In figure 2, what the top-left graph shows is that we had indeed a small increment with variant B from almost the beginning. And this small increment remained from day 20 to the end. Now, if you recall equation 2, which gives the formula of z (from which we compute the p-value), the value depends on the increment magnitude (λB - λA), but also on the volumes (Na, and Nb). That is why, with a stable increment, we finally got the p-value below the threshold with the increase of the volumes.

python code to generate figure 2
figure 2: frequentist results synthesis

Bayesian approach

After each day, we sum the unique visitors and conversions with the previous values of n and c that we’ve been accumulating from day 1. Then we compute the posterior probability density functions that led to the joint posterior plotted in figure 3. It’s a very interesting plot because we can visually understand which variant has the highest probability to give the best conversion rate:

  • The orange line represents the space where conversion rates for both variations are equals.
  • The upper part (upper to the orange line) gives the space where λB > λA. We easily get P(λB > λA) by integrating the joint posterior (blue contour plot) over this upper area.
  • The upper part also enables us to compute the expected loss associated with the choice of variant A: all points where λB < λA are set to 0, and the other points are weighted by P(A,B), the joint posterior density, before integration. It means that we treat mistakes of different magnitudes differently. This is illustrated by the red and black points in figure 3 below.
Create the joint posterior plot with python
figure 3: joint posterior after 8 days of A/B test (axes are conversion rates, blue line plots are beta distributions corresponding to each variant conversion rate distribution)

The red point corresponds to a case that has a high probability to happen and a small magnitude of difference of conversion rates. The black point is the opposite.

Visually, in figure 3 above, we understand that variation B seems to be more promising than variation A because the main of the contour plot is above the orange line. Let’s check it with some computations.

We used Importance Sampling Monte Carlo integration method to compute values for the interesting quantities. You can learn more about this and the implementation in python in this article. We set 𝜖 = 0.0001 which represents the relative drop of conversion rate we are willing to accept in case of a bad choice of variant. thus, we would choose the variation that will give a lower expected loss if it exists.

figure 4: Bayesian results

Conclusion: After only 30 days of A/B test, the probability to make the good choice with B had been around 80% for already 12 days, and the associated expected loss was far below the drop threshold of 0.0001. Variant B could have been chosen at this moment.

This case reveals two main advantages of the Bayesian method:

  • Bayesian A/B testing may always lead to usable results, whatever the volumes.
  • Interesting results may be reached sooner than with the frequentist approach.

Conclusion

In a context of very large volumes and no time constraints, the frequentist approach is easy to implement and gives satisfying results. However, the p-value concept is often misunderstood and the business may think we answer the following question while we’re not: “what is the probability that version B is better than version A?”. Whereas, Bayesian A /B testing could tackle this specific question better.

However, it implies some complex calculations and numerical integration methods. It may be a good idea to implement this method only in the context of small volumes or when we need to conclude A/B test quickly. By iterating more quickly we may also interestingly accumulate marginal gains on top of each other.

References

--

--