tl;dr Bayesian A/B Testing with Python

Anyone who’s run an A/B test knows statistics has something to say about whether it’s A or B that wins. The nice thing about Bayesian A/B testing is that it’s (relatively) clear how we make that decision.

Let’s pretend we have an experiment running — are Alpacas or Bears better at converting users on our site’s landing page.

To follow along all you need is an ipython notebook and the scipy framework. The easy mode way to get started with this is Anaconda.

The Beta Distribution

The broad idea behind Bayesian conversion rate testing is to generate two distributions which cover all possible rates and then update them with information about the test performance and adjust our expectation of the most representative rate accordingly.

We can represent that here with a Beta distribution that has two parameters: α which represents successful conversions and β which represents people who exited without converting.

You can think of α and β like odds: 10:1, 2:3 etc. The one difference being that 4:6 represents a stronger belief in the same conversion rate than 2:3.

You can read more about the details of the Beta distribution here.

What even are priors?

Before we start the experiment we don’t know whether Alpacas or Bears are better for conversion.

Say we had no clue at all and thought that for both branches any conversion rate was equally likely. We can represent this with a single conversion and a single exit or Beta(α=1, β=1).

Graphically the distribution that looks like this:

In reality, we know a little bit about the likely conversion rates. We can use this information to dismiss unlikely outcomes and speed up our test!

Let’s say we’ve tried Alpacas before and found they convert about 16% of the time. To represent this we can use the distribution Beta(16, 100–16). I’m skeptical about the performance however, so let’s scale it down to Beta(8, 42).

Both distributions are shown below. It’s worth playing around with these to see how the distribution changes.

Our Pretend Experiment

It’s about time we added some “real” data to this experiment. But first I’m going to let you in on a secret: Bears are better for conversion.

We’re going to model this by generating a small number of random results between 0 and 1, then picking values below a certain cutoff.

The cutoff represents our true conversion rate and values below it are our conversions.

What does our initial data look like?

Let’s take a look at our experiment results. The first thing we do is take our small amount of initial data and add our prior beliefs to both branches. After that we generate the posterior distributions, and make some graphs of the results.

This should generate results like the graph below. We can see a difference between the two branches, but they overlap so much it’s hard to say which is the true winner. (Note: They could randomly separate here but that would still be a weak result due to the low sample size.)

It’s pretty clear we need some more data to make an accurate decision.

Much better!

But how do we know for sure? To do that we need some more tools.

Bayesian Error Bars

A simple approach to comparing the two distributions is to generate error bars for each branch’s distribution. To do that we use the extremely useful CDF or Cumulative Distribution Function which turns a value in the distribution into its percentile rank.

What this means in practice is we can look at the y-axis and say things like “there’s an 80% chance the true conversion rate is below the corresponding x-axis value”.

For error bars we want to find the range captured by some arbitrary percentage, e.g. the 80% confidence interval. In this case we’d look for the median value and the 10% and 90% values of the CDF.

For a weak representation of a 30% conversion rate we can see the wide range the error bars cover:

Let’s see what our error bars are for the Alpaca vs. Bear experiment:

So it looks like the Bears win! But there’s still some overlap, so let’s double check to be really sure.

Bayesian p-values

All these Bayesian stats are making me miss p-values, so let’s make a Bayesian equivalent.

We want to answer the question “what is the probability that Alpacas are actually better than Bears.” We can try and achieve this by taking samples from both distributions and comparing which sample has the larger conversion rate. If we do this enough times we should get a pretty accurate read on the question above.

Our “p-value” is less than 0.05, so we can declare Bears the winner!

Small Victories

Doing testing to find really small wins is very expensive for most businesses. It makes sense to check just how much of an improvement we think the Experiment is vs. the Control.

To do this we can generate a CDF of the B samples over the A samples:

As expected there aren’t many values below 0 meaning the A branch didn’t win very often. Eyeballing the median we can see it’s pretty close to 1.1, agreeing with our initial “10% better” setting.

Winner!

So we’re pretty convinced that the branch we chose to win did, in fact, win.

This has been an overview level description of why each of the steps are important, but I encourage you to check out the links to Will Kurt’s site Count Bayesie from which a lot of this is adapted.