Optimizing Websites with Bandits

Using bandit policies to efficiently learn as you go

Published in

Red Ventures Data Science & Engineering

13 min readAug 14, 2019

Making good decisions requires information about your options. In the absence of information, people are generally understanding when you make mistakes — after all, mistakes are a great way to generate data about your options. For this reason, people say things like “don’t knock it till you try it.” But people also lose sympathy for you when you keep making the same mistakes over and over — “fool me once,” and all that. This reflects an intuition that there should be a balance between trying new things to gather information and taking advantage of the information that we get through our trials. In machine learning, this is sometimes talked about as the “explore vs. exploit” dilemma.

At Red Ventures we run a lot of websites — something like 50 sites serving about 115 million sessions per month. With each site we run, there are always options about which products we should feature, which special offers, which design — even things like colors and images can make a difference between a good experience and one you’d rather not stare at. For this reason, we need systems to efficiently test different experiences. This article will introduce some of these methods and provide example code, so that you can do your own exploration.

Take 1 — A/B Testing — Nonstandard Approach

If you’re thinking about testing things for the first time, your impulse might be to try each option a bunch of times and then use the one that does best. The main nuance with this approach comes in quantifying how many times “a bunch” is and in interpreting the results of your trial. There are a few ways to think this through, but here is the most straightforward one that I know of.

Suppose we found a coin and were wondering the probability θ that it would come up heads (I’ll refer to this number as the “heads rate” for the coin). Let’s say we flipped it 10 times, and it came up heads twice. We could encode this as a tuple:

D = (T, T, T, H, T, T, H, T, T, T)

What would be a good model for the heads rate? If you’re thinking now that the heads rate is θ = 1/5 for this coin — you should ask yourself how sure you are. Maybe it’s actually a fair coin? A probability density function could describe our beliefs about the true heads rate of the coin. So, what should this density look like?

Some might say that the shape of this distribution should depend not just on the evidence we have before us (10 flips of the coin with two heads) but also our prior beliefs about the coin. For example, our interpretation of the 10 flips would likely be different if the coin were a US quarter straight from the Mint versus a Galleon handed to us by Harry Potter. Bayes’ rule gives us a way to reason about these two forces: (1) our prior beliefs and (2) the data we have.

The Likelihood Function and the Prior Distribution

We would like to encode the data we have gathered. One way to do this would be to consider the probability of seeing this data under the circumstance of having a given heads rate of θ. In this case, the probability would be given by:

This is referred to as the “likelihood function,” where a is the number of heads and b is the number of observed tails. We will generally think of this as a function of our parameter of interest θ. What we would really like to know about is P(θ|D) — the distribution for θ given the data we’ve seen, which we call the posterior probability. The link between our known likelihood function and this posterior is given by Bayes’ rule (which follows directly from the definition of a conditional probability):

D for us is fixed, so it makes sense to think of P(D) as a constant c — independent of θ — to normalize our posterior in the right way. We might also — for the moment — take for P(θ) a uniform distribution where P(θ)=1 for all θ between 0 and 1. With these reductions and substituting in our calculated likelihood, we have that:

Moreover, we can calculate the value of c from the knowledge that our posterior should integrate over θ to a value of 1. So, we have:

This is known as a beta distribution — and it appears as a natural choice for us. It is specified by two parameters, which we can think of as the number of successes and failures. Convention is to write it as

where c is the appropriate constant. Moreover, the beta distribution has the excellent property that, when we choose this distribution for the prior or posterior, it enforces (by Bayes’ rule) the distribution on the other. This is often summarized by saying that the beta distribution is a conjugate prior for the Bernoulli likelihood function.

Estimating the heads rate for our mystery coin

Let’s look at our coin story to see an example of this. We had data consisting of 10 flips with 2 heads and 8 tails. This gives us a likelihood of

The choice of prior should encode our belief about the situation before considering this data. Let’s say that our prior is given by a Beta(5,5) distribution, which is to say that we think it is a fair coin but have a very wide confidence interval on this assumption. The posterior can then be computed as:

By this same logic, we can see that if we had instead chosen a Beta(1,1) (that is, uninformative) prior, then we would arrive at a Beta(3,9) posterior. In this way there is an obvious mechanism by which the prior belief and empirical data are mixed to arrive at the posterior. Below are graphs of the various Beta distributions being discussed (created using R with the following code).

theta = seq(0,1, length=1000)plot(theta, dbeta(theta, 7, 13), type =”l”, col=4)
lines(theta, dbeta(theta, 3, 9), type =”l”, col=3)
lines(theta, dbeta(theta, 5, 5), col=2) 
lines(theta, dbeta(theta, 1, 1), col=1)legend(“topright”, 
    c(“Beta(7,13)”,”Beta(3,9)”,”Beta(5,5)”, “Beta(1,1)”) ,
    lty=c(1,1,1,1),
    col=c(4,3,2,1)

Now that we have distributions that describe the posterior probability of the heads rate of our coin under various priors, we can reason about what types of behavior we can reasonably expect from the coin. For example, we can use the associated cumulative density functions to estimate the probability that the heads rate of the coin in question is greater than 50%:

In R, this could be accomplished by:

> 1 — pbeta(0.5, 3, 9)
[1] 0.03271484
> 1 — pbeta(0.5, 7, 13)
[1] 0.08353424

for the posterior given by the uninformative prior and the Beta(5,5) prior respectively.

Relationship with testing

So far, our discussion has revolved around understanding the heads rate of a single coin by data generated through trials (or flips) of this coin. A more interesting situation is when we have several coins to choose from and we want to determine the coin with the highest heads rate. This is a situation that comes up at Red Ventures frequently. In particular, we may have three different offers that could be shown on a website — each with its own conversion rate. We would like to understand these conversion rates, so that we know which offers resonate most with users and which offers will drive the most conversions.

In this case, we can choose priors for the offers, randomly serve offers to visitors to collect data, and then calculate posteriors for the different conversion rates as described above. We can then use these posterior distributions to reason about our options and decide how to move forward.

While this strategy is satisfactory from a knowledge gathering perspective, it leaves something to be desired from an efficiency perspective. To see this, suppose that we are testing three offers. After a day of showing offers randomly, we have the following conversion data:

Option 1: 100 views, 6 conversions
Option 2: 100 views, 12 conversions
Option 3: 100 views, 15 conversions

With uninformative (Beta(1,1)) priors, this data gives us the following posteriors:

Posteriors for our 3 offers after 1 day of randomized testing

Looking at these posteriors, you may have the feeling that testing should continue tomorrow, but also that we’re losing conversions by showing offer 1. Maybe offers 2 and 3 should receive a greater portion of tomorrow’s traffic to help mitigate some of these losses?

Testing ideas in proportion to their performance is the core concept behind an idea called Thompson Sampling that balances exploration (or gaining information about your options) and exploitation (maximizing conversions over time by using the information you get along the way). We’ll use the next session to dive into this idea and try to quantify how much more efficient it can be.

Take 2 — A Bandit Approach

I tend to think about the testing based view of decision optimization from a “knowledge focused” perspective. That is, I’m focused on understanding each option and quantifying this understanding with very small error bars. When we have this understanding, the optimization part becomes a result of the knowledge. If we can quantify how well each action performs, then we know how to act. What if we instead prioritized, not knowledge, but performance over a period of time? This is — in essence — the bandit problem.

The bandit problem

The original phrasing of the bandit problem is as follows: Imagine three slot machines in front of you. Every minute you get to pull the handle on one slot machine and receive a “reward” — let’s say this reward is 0 or 1. Your goal is to maximize the sum of all rewards received.

It turns out that there are many flavors of this problem, but we’ll assume that the reward for slot i (for each i) is sampled i.i.d. from a static unknown distribution ν(i) (this assumption lands us in what is called the “stochastic multi-armed bandit problem” and precludes situations where rewards are being carefully chosen by some unknown adversary).

One thing you might be wondering is: “why is this called ‘the bandit problem?’” Bandit here is short for “multi-armed bandit.” At some point slot machines were known as “one-armed bandits” because they have one arm and tend to steal your money.

This problem is analogous with our offer selection problem: for each visitor, we select offer 1, 2, or 3 to show them, and then we receive (as a reward) either a conversion or a non-conversion. Our goal is to have as many conversions (or sales, or clicks, or signups) as possible.

Thompson Sampling

There are many approaches to this problem, but the method that stands out for its theoretical and empirical performance is also one that is quite old and simple. Proposed in 1933, Thompson Sampling works by constructing for each option (usually called arms in the bandit language) a posterior distribution for the unknown reward rate of that arm. When it is time to select an option for a user, we sample from each posterior and select the option with the highest sample. After the reward is received, the appropriate posterior may be updated.

My very own copy of Thompson’s original paper on the subject gifted by Keith Williams.

More formally, the problem is stated as:

And the proposed solution can be written as:

Let’s think through how Thompson Sampling would behave in the case of our three offers from above. At that point in time, we have:

Option 1: 100 views, 6 conversions
Option 2: 100 views, 12 conversions
Option 3: 100 views, 15 conversions

and so we would expect the Thompson Sampling algorithm to recommend all options — but not equally. We would expect options 3 and 2 to receive the most sessions and option 1 to receive far fewer sessions. To get a sense of this behavior, we can take 10,000 samples from each of these distributions and count how often each arm would be chosen. In R, this could be accomplished with:

> numSamples <- 10000
> offer1Samples <- rbeta(numSamples, 6,94)
> offer2Samples <- rbeta(numSamples, 12, 88)
> offer3Samples <- rbeta(numSamples, 15, 85)
> 
> sum(offer1Samples > offer2Samples & offer1Samples > offer3Samples)
[1] 72
> sum(offer2Samples > offer1Samples & offer2Samples > offer3Samples)
[1] 2655
> sum(offer3Samples > offer1Samples & offer3Samples > offer2Samples)
[1] 7273

So we see that at this point in time, the Thompson Sampling algorithm would select option 1 less than 1% of the time, option 2 about 27% of the time and option 3 about 73 % of the time.

Empirical Results

It’s important to remember here that the selling point of Thompson Sampling is not that it unlocks more knowledge than an A/B test, but that it learns more efficiently. To try to quantify this, let’s do an experiment. Suppose that I have three different offers that can be displayed on a web page and that I have allotted 10,000 sessions to testing these offers. Let’s say that the real conversion rates for the three offers are:

Offer 1: conversion rate 12%
Offer 2: conversion rate 5%
Offer 3: conversion rate 9%

One useful way to gauge the efficiency of a test is to consider the number of conversions lost when compared to showing the best option at all times. This concept is known as regret. The regret (or expected number of lost conversions) of splitting traffic equally between the three options can then be calculated as:

> .12*10000 — .12*(10000/3) — .05*(10000/3) — .09*(10000/3)
[1] 333.3333

So in running a traditional split test on these sessions, we pay 333 conversions for whatever knowledge we get. What then is the regret incurred by using Thompson Sampling? We can answer this question with a simple simulation (written in R here).

set.seed(7)#simulation parametersrate1 <- .12
rate2 <- .05
rate3 <- .09numSessions <- 10000
numArms <- 3modelParams <- list()
rates <- c(rate1, rate2, rate3)
bestRate <- max(rates)for (i in 1:numArms) {
 modelParams[[i]] <- c(1,1)
}rewardHistory <- c()
armHistory <- c()#helper functionsflip <- function(p) {rbinom(1,1,p)}
sampleBeta <- function(params) {rbeta(1,params[1],params[2])}
vSampleBeta <- function(lparams) {simplify2array(lapply(X = lparams, FUN = sampleBeta))}
sampleArm <- function(lparams) {which.max(vSampleBeta(lparams))}#simulationfor (i in 1:numSessions) {
 arm <- sampleArm(modelParams) #choose arm (an integer)
 reward <- flip(rates[[arm]])
 if (reward == 1) { #update modelParams
 modelParams[[arm]][1] <- modelParams[[arm]][1] + 1
 } else if (reward == 0) {
 modelParams[[arm]][2] <- modelParams[[arm]][2] + 1
 }
 rewardHistory[i] <- reward
 armHistory[i] <- arm
}

This code constructs naive priors for each option and then recommends offers based on the posterior distributions — updating the posteriors as new data becomes available along the way according to Bayes’ rule.

We can then look at the history of recommended arms

> table(armHistory)
armHistory
1     2    3 
9275  195  530

to see that this bandit policy quickly discards option 2 and spends a bit longer learning about option 3, but ultimately recommends option 1 (the optimal choice) about 93% of the time over the life of the experiment. (Note that, since this is a stochastic process, your output may look different).

The number of conversions under this policy can then be counted and compared to the number of conversions expected when only the best arm is shown:

> bestRate*numSessions — sum(rewardHistory)
[1] 33

So, we see that we pay only 33 conversions for our knowledge in this case. Compared with a traditional split test we gain an extra 300 conversions by choosing this methodology.

Let’s look at the beta distributions learned under these two policies. Under the split test, we have:

and under Thompson Sampling, we get:

These pictures help to illustrate the fact that, in a traditional split test, we seek out information equally among all options. As a result, each distribution gives a relatively tight estimate of the true conversion rate. By contrast, with Thompson sampling we learn only enough about the less optimal options to know that they’re not better than the best option. This is evidenced by the wider degree of uncertainty captured for offers 2 and 3. In short, we don’t care to carefully quantify the conversion rates of our underperforming assets — we just want to know that they are underperforming.

What next?

There are many enhancements to the above strategy that we have found valuable at Red Ventures. While discussions of how these can be implemented will have to be postponed until a future post, here are some challenges that are worth handling with additional methodology. Interested in hearing more about a particular topic? Let me know in the comments!

Non-binary reward structures

In general, we might want to optimize toward a non-binary event like revenue per session.

Dynamic reward structures

Just as fashion preferences drift over time, we might expect user preferences to be dynamic. You may want a way for information to decay over time, so that you don’t develop a policy that becomes out of date.

Contextual decisions

In reality, it is unlikely that all of your users will respond best to a single treatment. Ideally your policy should be able to distinguish between different users and learn to make personalized decisions. This is a problem called the contextual multi-armed bandit problem. Internally, we see — on average — an additional 9% lift to conversion rates leveraging contextual optimization.

Abstracting your action space

It may make sense to think of your site modularly so that experiences with overlapping content can share learnings. Capturing this overlap can help to speed up learning and allow for the creation of meaningful analytical feedback to your creative teams.

Handling delayed feedback

In practical application, rewards are not immediately available, and we must wait for a period of time to allow users to provide feedback (not to mention latency in the data pipeline). Understanding how this lag affects the efficiency of the algorithm can help you to design efficient experiments.