Smarter Multi-Armed Bandit Testing

3 min readApr 7, 2016

Recently there has been a rising interest in the multi-armed bandit testing methodology. Articles like 20 lines of code that will beat A/B testing every time greatly contributed to this. However, the approach described in the previously mentioned article can be greatly improved with a simple, yet clever idea.

For those of you who have not read that article I will describe their approach briefly.

def testing():
    if math.random() < 0.1:
        # Exploration!
        # choose a random option 10% of the time.
    else:
        # Exploitation!
        # for each option calculate the expectation of reward
        # choose the option with the greatest expectation of reward
    display_choice()
    # do not forget to update choice usage and reward

What is wrong with that?

Let’s say our experiment chooses the the position of a button on a website and we have 3 possible choices: A, B, C. Our reward is the click-through rate. For the sake of argument let’s assume the following expected rewards after running the experiment for 1 day:

A: 4%
B: 11%
C: 10%

The way the algorithm works is that for 10% of the population it will randomly chose A, B or C, each with 0.33 probability. For the rest of the 90% of the population we will always show the button on position B.

What does this hide?

Most experiment results tend to change over time. For instance, in our example, position B may have been the default position on the website for the button until now. However, position C shows great promise since it almost matched B’s performance on the first day. But since B marginally performed better in day 1, the wide majority of users will continue to see it, while only 3.33% of the users will see the button on position C.

What does this mean? This means that it will take longer for C to move the experiment’s results and become the preferred choice, even though it may be the better choice.

Suggested approach

Our approach suggests a smarter way of making the choice. Instead of always choosing the option with the highest expected reward for 90% of the population, we will make a probabilistic choice, weighted by the expected reward. We can do this very easily by dividing each expected reward by the sum of all expected rewards. On our example, for 90% of the population will choose A, B or C with the following probabilities:

A: 0.16 = (4 / ( 4 + 11 + 10))
B: 0.44 = (11 / (4 + 11 + 10))
C: 0.40 = (10 / (4 + 11 + 10))

From my experience, this approach has yielded the best results in production. The pseudo-code for the algorithm will look like this:

def testing():
    if math.random() < 0.1:
        # Exploration!
        # choose a random option 10% of the time.
    else:
        # Exploitation!
        # for each option calculate the expectation of reward
        # probabilistically choose the option weighting each one with its expectation of reward
    display_choice()
    # do not forget to update choice usage and reward

Caveats

Also, please take into consideration that when implementing the experimentation methodology the same user should see the button in the same place on the website, on each visit, for the duration of the experiment. Not doing so will result in non-independent results and will invalidate your experiment.

Smarter Multi-Armed Bandit Testing

What is wrong with that?

What does this hide?

Suggested approach

Caveats

Written by Vlad-Doru Ion