The Power of Bayesian A/B Testing

The data science team at Convoy believes that the frequentist methodology of experimentation isn’t ideal for product innovation. We switched to an A/B testing framework that uses Bayesian statistics because it allows us to innovate faster and improve more. Given our use case of continuous iteration, we find that Bayesian A/B testing better balances risk and speed.

In experiments where the improvement of the new variant is small, Bayesian methodology is more willing to accept the new variant. By using Bayesian A/B testing over the course of many experiments, we can accumulate the gains from many incremental improvements. Bayesian A/B testing accomplishes this without sacrificing reliability by controlling the magnitude of our bad decisions instead of the false positive rate.

In this post, I’ll cover the basics of experimentation, present the benefits of Bayesian A/B testing, and discuss the nuances of using it effectively.

Experimentation Background

In any A/B test, we use the data we collect from variants A and B to compute some metric for each variant (e.g. the rate at which a button is clicked). Then, we use a statistical method to determine which variant is better.

Frequentist Methods

In frequentist A/B testing, we use p-values to choose between two hypotheses: the null hypothesis — that there is no difference between variants A and B — and the alternative hypothesis — that variant B is different. A p-value measures the probability of observing a difference between the two variants at least as extreme as what we actually observed, given that there is no difference between the variants. Once the p-value achieves statistical significance or we’ve seen enough data¹, the experiment is over².

Bayesian Methods

In Bayesian A/B testing, we model the metric for each variant as a random variable with some probability distribution. Based on prior experience, we might believe that the conversion rate for our website has some range of possible values.

After observing data from both variants, we update our prior beliefs about the most likely values for each variant. Below, I show an example of how the posterior distribution might look after observing data.

By calculating the posterior distribution for each variant, we can express the uncertainty about our beliefs through probability statements. For example, we can ask “What is the probability that the metric under variant B is larger than the metric under variant A?”. Interpretable output helps data scientists have productive discussions with other colleagues about the optimal business decision in ambiguous situations³.

Before explaining how Bayesian A/B testing decides which variant is better, I want to present an important scenario where frequentist methods struggle to provide a satisfying solution.

Problem With Frequentist Testing: The Slightly Better Model

One of the problems with frequentist experimentation is that it can unnecessarily favor the null hypothesis. For example, imagine that a data scientist wants to run an experiment that tests a new version of a model. After observing enough data, we find that the new model is only slightly better than the current model, leading to a p-value of 0.11. Under frequentist methodology, the proper procedure in this scenario is to keep the current model. However, since the new model is making better predictions than the current model, this decision is very unsatisfying and potentially costly.

Of course, there are scenarios where we want to stick with the null hypothesis when the treatment variant is marginally better than the control. If the treatment requires a lot of engineering maintenance or causes a disruption to the user experience, the costs of implementing the new variant might outweigh the small benefits⁴. However, in our experience at Convoy, this scenario is uncommon.

In scenarios similar to the one of the slightly better model, Bayesian methodology is appealing because it is more willing to accept variants that provide small improvements. Over the next few years, as we perform hundreds of experiments on the same handful of key business metrics, these marginal gains will accumulate on top of each other. Crucially, since we conclude an experiment once we are confident it will provide at least a small improvement, we can iterate more quickly and run more experiments over all.

By accepting variants that offer a small improvement, Bayesian A/B testing asserts that the false positive rate — the proportion of times we accept the treatment when the treatment is not actually better — is not very important. While this may be shocking to some statisticians, we agree with this sentiment because not all false positives are created equal. Choosing variant B when its conversion rate is 10% and the conversion rate for variant A is 10.1% is a very different mistake than choosing variant B when the conversion rates are 10% for B and 15% for A. Yet, under frequentist methodology, these would both count as a single false positive.

Instead, Bayesian A/B testing focuses on the average magnitude of wrong decisions over the course of many experiments. It limits the average amount by which your decisions actually make the product worse, thereby providing guarantees about the long run improvement of a metric. We believe that these types of guarantees are much more relevant to Convoy’s use case than the false positive guarantees made by frequentist procedures.

Bayesian Methodology

In this section, I explain how Bayesian A/B testing makes decisions and how it provides guarantees about long term improvement. This methodology is from a white-paper by Chris Stucchio.

Let α, β represent the underlying and unobserved true metric for variants A and B. Let x represent the variant that we choose. Then, we can define a loss function for a given experiment as

Here, we visualize the loss of choosing variant A as a function of β — α.

If we choose variant A when α is less than β, our loss is β - α. If α is greater than β, we lose nothing. Our loss is the amount by which our metric decreases when we choose that variant.

Because we cannot observe the true values of α and β, we compute⁵ the expected loss over the joint posterior density, f(α, β).

Once the expected loss for one of the variants drops below some threshold, ε, we stop the experiment. We have enough evidence to conclude that choosing that variant is a good business decision.

This stopping condition considers both the likelihood that β — α is greater than zero and also the magnitude of this difference. Consequently, it has two very important properties:

  1. It treats mistakes of different magnitudes differently. If we are uncertain about the values of α and β, there is a larger chance that we might make a big mistake. As a result, the expected loss would also be large.
  2. Even when we are unsure which variant is larger, we can still stop the test as soon as we are certain that the difference between the variants is small. In this case, if we make a mistake (i.e., we choose β when β < α), we can be confident that the magnitude of that mistake is very small (e.g. β = 10% and α = 10.1%). As a result, we can be confident that our decision will not lead to a large decrease in our metric.

Understanding The Methodology

Since the expected loss for a variant is the average amount by which our metric would decrease if we chose that variant, we should set ε to be so small that we are comfortable making a mistake of that magnitude. The idea is that we would rather make a mistake of magnitude ε and then move onto a more promising experiment than spend too much time on the first experiment.

To get a better understanding of ε and to see evidence of the guarantee that Bayesian A/B testing makes, it’s useful to run some simulations. Consider 250 experiments where, in each experiment, we stop the test once the expected loss of either variant is below ε. Assume that the true values of α and β in all experiments are drawn independently from the prior distribution shown above.

Once all experiments have finished, we use the true values of α and β to calculate our average observed loss. If we repeat this process for many different values of ε, we get the following picture, which can be replicated with this script on github.

The graph demonstrates the guarantee that Bayesian A/B testing provides. When we stop an experiment, we can be confident that, on average, we are not making a decision that will decrease our metric by more that ε. This guarantee allows us to iterate quickly and watch as our metrics steadily increase from experiment to experiment.

Get the Most out of the Data

Justifying Prior Choice

Critics of a Bayesian analysis might argue that the choice of a prior distribution was not sufficiently justified and had a significant impact on the experiment. In fact, the simulation presented in the previous section assumed that we used the perfect prior distribution.

Fortunately, for companies that run A/B tests continuously, there is usually a wealth of prior information available. The metrics that Convoy monitors in our A/B tests are key performance indicators used throughout the entire company. By examining weekly averages of our data or the results of past experiments, we can develop a good understanding of the likely range of values that the metric can take.

Reaching Conclusions Faster

Using relevant prior information makes experiments conclude faster. Every piece of information that we embed into the prior is a piece of information that we do not need to learn from the data. By leveraging priors, Bayesian A/B testing often needs fewer data points to reach a conclusion than other methods.

For example, let’s say we use a Beta(1, 1) distribution as the prior for a Bernoulli distribution. After observing 40 successes and 60 failures, our posterior distribution is a Beta(41, 61)⁶. However, if we had started with a Beta(8, 12) distribution as our prior, we would only need to observe 32 successes and 48 failures in order to obtain the same distribution as before.

We do not recommend using a prior distribution so strong that it overwhelms any data that is observed. It’s generally good practice to choose priors that are a bit weaker than what the historical data suggest.

Conclusion

There are some disadvantages to using Bayesian methodology for A/B testing. It can be difficult to explain the notion of expected loss to others. Also, it took us a few experiments until we landed on standard values of ε for different types of experiments. As we get more comfortable with the decision making process, it is a bit easier to reason about values of ε that appropriately balance risk and speed.

With that being said, we find that the benefits of Bayesian A/B testing outweigh the costs. Due to Convoy’s long backlog of experiments that we want to run, it’s important that we move on from inconclusive experiments quickly. We want to fail fast and keep iterating until we find a product innovation that is a big winner. For Convoy, Bayesian A/B testing is not a thought experiment. It is a key accelerator as we transform the transportation industry.

Footnotes

  1. Typically, we wait until we have 95% statistical significance or enough data to make a decision with 80% power.
  2. We can use sequential analysis to avoid the problem of continuous monitoring.
  3. I built this shiny app to empower colleagues to understand their data and learn about Bayesian A/B testing.
  4. Bayesian A/B testing allows us to directly quantify these costs. See the appendix for more details.
  5. Assuming we use conjugate priors, there are analytic formulas for these computations.
  6. The update rule for the posterior distribution can be found here.

References

The work done by Chris Stucchio and Evan Miller was instrumental in our adoption of a Bayesian A/B testing framework. I also found David Robinson’s post very helpful when reading other evaluations of Bayesian A/B testing.

Appendix

Bias Towards the Null Hypothesis

There are situations where we might not want to be indifferent between the control and treatment variants. Fortunately, the loss function used in Bayesian A/B testing is very customizable. For example, we can write:

With this loss function, δ is the amount by which β needs to be better than α in order for us to switch to variant B. For each experiment, δ can be quantified, with the default being δ = 0.