Do Bayesian A/B Testing With Me Bro

Shout-outs

If you work in growth you will be involved in doing A/B or multivariate testing. I’ve run my fair share and have learned some lessons on what to do and not to do. In this post I will share a step by step process of doing a multivariate test using the bandit R package and a Bayesian approach.

This post is based on a collection of ideas and methods from other blogs and therefore not really original. Nevertheless, my goal is to present an easy to follow guide to use for A/B/C/D testing with tools freely available to anyone.

It’s only fair to begin the post with acknowledging the blogs and sources that have been helpful.

Bayes Theorem Explained with Legos. This is an introduction to Bayesian theory explained with Legos. It is intuitive and explains the Bayes Theorem well. We use a Bayesian method to evaluate the results of our test for multiple reasons:

  • In some cases it saves us time or money as you can get a result faster
  • We can actually tell what we stand to lose by picking one variation over another

Evan Miller’s AB testing. If you are using a frequentist method when doing A/B testing you have to read his blog. Very useful ideas regarding A/B testing and Bayesian A/B testing. Has some code on declaring the winner in a bayesian A/B test but we will be using a different piece of code.

Chris Stucchio from VWO. Both the blog, the presentation and the paper he has written are great. They explain their methodology and address some issues with Bayesian testing. Very insightful

StoreMaven. It’s not really their blog but their great product gave me the idea of using the bandit algorithm for our testing. They are a great tool for A/B testing app stores on mobile and I recommend to everyone to give them a shot. A great group of people to work with. (I’m a client)

Bandit R Package This is the package we will be using to do our Bayesian calculations. You can get into fancy models with MCMC and JAGS on R to calculate the results but the point of this post is to enable anyone who does not know anything about bayesian a/b testing to give it a shot. To pre-empt some comments: Using this tool is not panacea and in some cases not 100% correct but it’s better than nothing and from my experience better and cheaper than using a frequentist method.

Requirements & Assumptions

Ok now that I have given the appropriate kudos and shout-outs let’s move on to set up a system that does a/b/c/d testing for us. You will need:

Onion Pritikin Recipe
  • 2 Cups of Testing Data
  • 1 Tablespoon of R code
  • 1 Bottle of beer
  • a Tableau instance (good to have)

Some of the assumptions I’ll be making for this test:

  • We are running an A/B/C/D test on a mobile User Acquisition creative.
  • The data is magically being tracked by our BI system and finds its way through to our database. I won’t discuss how this is happening, but all results is available to us.
  • There is no significant time-based influence in our data (no seasonality of any kind).
  • The data is obviously made up.
  • We don’t need to worry about our traffic, bidding type, bidding algorithm or other things that might be affecting our campaigns
  • The data updates every 24 hours and looks like this:
Screen Shot 2016-02-21 at 11.00.05 AM.png

Prior Selection:

The most common criticism against Bayesian methods is that the choice of prior adds a subjective bias in the analysis.

We are testing creatives that receive thousands of impressions or clicks. Therefore is no real difference on which prior we use, because we can reasonable expect the evidence to overwhelm it. Please keep in mind if your sample size is small, it’s important to put some thought on how you choose the prior. In our case we just use a very common beta distribution (a=b=1). This is called a flat prior and is used when someone has little knowledge about the data and wants the prior to have the least effect in the outcome of the analysis.

Step 1 — Load R

Launch an R instance.

For our purposes you will need to install and run the following packages:

library(scales)

library(ggplot2)

library(reshape2)

library(bandit)

Step 2 — Load Your Data

After we have installed and launched the above libraries. It’s time to load our data. I’m not sure what sort of database you have but if you are using redshift you can use an R package called RJDBC to create a connection to your database and pull via a query your data.

We are evaluating conversion results so there is two type of data groups, the population exposed to each variation (Group N) and the users that converted (Group X). The results need to be manually imported to R

x=c(100,95,96,98)

n=c(1000,1100,1050,950)

  • X: The group of users that clicked/ converted /installed during the test.
  • N: The group is the total amount of users that were exposed to the test per variation

Step 3 — Getting results

Now let’s use the bandit package to calculate the winners

pb=sim_post(x,n,ndraws=100000)

prob_winner(pb)

Screen Shot 2016-04-21 at 5.17.15 PM.png

The table above gives us the Bayesian probability each of the variations has might be a winner. So for variation 1 the probability to be a winner is 35.04%. For variation 2 the probability is 31.23% etc.

At this point we don’t have enough information to make a call. We have to wait a bit longer and get some more data.

Step 4 — Repeat Step 2 & 3

After a day we have the following data

x1=c(800,730,820,765)

n1=c(8000,8500,8050,7850)

Let’s rerun the same code and see the results

pb1=sim_post(x1,n1,ndraws=100000)

prob_winner(pb1)

Screen Shot 2016-04-21 at 5.20.20 PM.png

So Variation 2 has almost 0 probability to be a winner. Let’s look a little bit more in detail in the data:

significance_analysis(x1,n1)

Screen Shot 2016-04-21 at 5.22.04 PM.png

On this step we calculate the Bayesian probability each variation is a winner, given the posterior results

We calculate:

  • The probability for each alternative to outperform the next lower alternative (p_best)
  • The confidence interval on the estimated amount this variation outperforms the next alternative (lower & upper)

A couple of good things are happening here with Variation 2.

  • p-best is almost negligible
  • Moreover the lower and upper bounds for the variation ranked 3rd (Variation 4 in our example) are positive and the significant is at 1%.

This is pretty important as our decision to pause a variation depends on

  • Value Remaining is almost 0. The “value remaining” in an experiment is the amount of increased conversion rate you could get by switching away from the winning variation

So we can go ahead and stop Variation 2.

Let’s run the experiment on more day:

x2=c(1300,730,1450,1260)

n2=c(13000,8500,13300,12900)

pb2=sim_post(x2,n2,ndraws=100000)

prob_winner(pb2)

Screen Shot 2016-04-21 at 5.30.59 PM.png

All right Var 2 had no chance of improving since we paused it. Variation 4 and 1 are looking bad too. Let’s see if we can make a call

significance_analysis(x2,n2)

Screen Shot 2016-04-21 at 5.32.57 PM.png

Step 5 — Visualization

It would be more helpful if we could graph the different variations so that we can tell the overlap. You can do this with the following small piece of code

dimnames(pb2)[2]<-list(c(‘Var 1’,’Var 2',’Var 3', ‘Var 4’))

You need to name each column according to the variant. This step is important as the rest of the visualization won’t work without named columns of the simulated posterior result

melt_pb2<-melt(pb2)

ggplot(melt_pb2,aes(x=value, fill=Var2)) + geom_density( colour=’black’,size=1,alpha=0.40) + scale_x_continuous(‘Conversion %’,labels = percent) + scale_y_continuous(‘Density’) + scale_fill_discrete(‘Variations’) +geom_hline(yintercept=0, size=1, color=”black”)

Rplot.jpeg

Things to Note.

  1. The more impressions the more positive the kurtosis of the distribution. Essentially the more accurate our estimate will be
  2. Var 2 (the one we paused) vs Var 3 (winning one) have essentially no overlap.
  3. Var 3 is overlapping with Var 1 and Var 4. This means that although it’s winning there is a possibility if we pick it that the other 2 variations would do better

Step 6 — Calculate The Remaining Value

At this point it would be a good idea to calculate the remaining value:

The remaining value is the distribution of the improvement amounts another arm might have over the current best arm. Bandit compares the best arm against the second best. Essentially it tells us the amount of improvement we might forgo by selecting the winning variation. This is a very important element of the Bayesian methodology. With this method we know how much we stand to lose if we select an alternative.

value_rem=value_remaining(x2, n2, alpha = 1, beta = 1, ndraws = 100000)

summarize_metrics(value_rem)

Screen Shot 2016-04-21 at 5.36.03 PM.png

Let create a graph for the remaining value as well

ggplot(value_rem,aes(x=value, fill=value)) + geom_density(colour=’black’,fill=”red”,size=1,alpha=0.4)+scale_x_continuous(‘Conversion %’,labels =percent, limits = c(0, 0.03) )

Rplot01.jpeg

Although not ideal, we can see that there is a little bit of remaining value left. This means that worst case scenario if we picked the wrong variation (Var 3) as the winner we would be foregoing a gain of 0.01%. This is a pretty negligible gain and at this point we can either call Var 4 the winner or wait for one more day.

Ideally you want to call a winner when remaining value is 0. If though you are running out of money or time or both then you have to evaluate the remaining value. If it’s small enough then you can pick the winner and move on.

In our case the difference is negligible and I will pick it as a winner.

Great things about this approach:

  • You know how much you stand to lose
  • Peeking is less of a concern (although not eliminated)
  • You might get results faster
  • You can answer the which one is better question in a more understandable way.

Please feel free to send me your comments and opinions at: growthtales at gmail.com.

If you liked this blog you will love our Sonic Game: