Multi-armed Bandits: an alternative to A/B testing

How to accelerate the results of an A/B testing and avoid unnecessary costs?

Bruno Eidi Nishimoto
6 min readJul 20, 2021

The portuguese version of this article is available in Multi-armed Bandits: uma alternativa para testes A/B

Photo by Carl Raw on Unsplash

A/B testing is a widely used practice in marketing to compare two or more versions of a product (advertisement, design, landing page, etc.) and get the one that has the best performance, that is, the one that has a conversion rate superior to others. In the case of online ads, for example, it would be the one that generated the highest click through rate.

A/B testing illustration.

The figure above illustrates how A/B testing works: the different variants of the product are randomly divided to the users with a fixed distribution. Thus, a part of the public only receives version A of the product while the other part only version B for a fixed period of time. At the end of this period, statistical tests are applied in order to see which of the different versions has a better result.

However, there are some problems related to A/B testing:

  • there is a high cost with the “bad” versions of the product, because they will continue to be offered to the user until the end of the test even if their returns are not good;
  • there is also the risk of getting churned users, because those who received the “bad” version of the product during the whole test may have a poor experience with the product.

So is there any alternative to A/B testing to avoid these problems? That is, how we can reduce the costs associated with poor product versions?

Multi-armed Bandits

Animation for Multi-armed Bandits. Source: https://multithreaded.stitchfix.com/blog/2020/08/05/bandits/

Multi-armed Bandits (MaB) [1] is a specific and simpler case of the reinforcement learning problem in which you have k different options (or actions) A₁, A₂, …, Aₖ, each associated with a real distribution R= { R₁, R₂, …, Rₖ} corresponding to the rewards, where μ₁, μ₂,…, μₖ are the mean of these distributions, respectively. Thus, the agent executes an action Aₜ iteratively according to its policy and receives a numerical reward rₜ (sampled from the distribution Rₜ) associated with that action. The goal is to maximize the rewards received during the interactions, that is, find the action that has the highest expected reward. In other words, the agent needs to learn the reward distributions associated with each action in order to obtain the one with the highest expected return. A possible metric to assess agent learning in T interactions is regret ρ, given by the expected difference between the sum of the reward associated with the optimal action and the sum of the collected rewards:

Regret of the agent in T interactions.

Where μ* is the mean of the reward distribution associated with the optimal action and rₜ is the reward received at instant t.

The base algorithm for MaB is simple:

1. Obtain an initial reward estimation for each of the k actions
2. For each episode t until T:
3. Select an action according to a policy
4. Get the reward for this action
5. Update the chosen action's reward estimation

The policy is the critical point of a MaB algorithm, which is related to how the agent will choose the action according to its current rewards estimation. This action choice depends on the exploration and exploitation strategy of each policy.

Exploration vs Exploitation

Imagine that you are in a new city and you don’t know any restaurants in the city. The first time, you go to restaurant A, and you like it a lot, meaning it has a good reward. Next time the dilemma is: should you go to the same restaurant (exploitation), since you already know it’s good, or should you visit a new restaurant (exploration), to try to find another one that is possibly better?

Ideally, in the beginning, more exploration should be done so that the agent can know better the environment it is interacting with, and in the end, more exploitation to exploit the best action considering that the agent already knows the environment it is interacting well, i.e., it has a good reward estimation for each action.

It is precisely in this balance between exploration and exploitation that the different MaB algorithms work. We will see in a later article some MaB algorithms detailing each of them. Now, let’s see how we can apply MaB in A/B testing and what are its advantages :).

MaB and A/B testing

One application of MaB is exactly in A/B testing, in which the different product versions are mapped to the actions, and the conversion rates of each product to the rewards. And since the true conversion rate of each product is not known a priori, MaB fits well into the problem of finding the optimal version, i.e., which brings the greatest reward.

Because the nature of the MaB is an iterative method, its major advantage is that it can dynamically change the distribution of users who will be impacted with each product version, prioritizing those who obtained better rewards, that is, assigning a greater volume of users to that version. In other words, the method allows you to learn during the test instead of the end of it. This makes it possible to reduce the costs associated with “bad” versions as they will be de-prioritized during the process.

Furthermore, we are able to obtain the result faster than in the A/B testing, as the distribution progressively favors the variant with the highest conversion.

Comparison between A/B testing and MaB over the time. Source of image: https://vwo.com/blog/multi-armed-bandit-algorithm/

That said, the focus of MaB is to maximize gains throughout the testing period by dynamically modifying the distributions of each product variant. With that we have the advantages already mentioned: cost reduction and obtaining the best group faster. The trade-off we have with respect to A/B testing is that it is not possible to perform statistical tests to demonstrate that two versions are statistically distinct or not.

Conclusion

In this article we looked at what A/B testing is and some of its main problems and introduced MaB and how it is applied as an alternative to A/B testing.

In a later article, we will present the major MaB algorithms, analyzing the strategy of each one to balance exploration and exploitation.

References

1. Slivkins, Aleksandrs. Introduction to Multi-Armed Bandits — https://arxiv.org/pdf/1904.07272.pdf

2. Gupta, Shubhankar. Multi-Armed Bandit (MAB) — A/B Testing Sans Regret — https://vwo.com/blog/multi-armed-bandit-algorithm/

3. Veleaev, Denis. How to generate a lot more leads and reduce costs? A/B Testing vs Reinforcement Learning. — https://towardsdatascience.com/reinforcement-learning-64730aaa46c9

4. Rafferty, Greg. A/B testing — Is there a better way? An exploration of multi-armed bandits — https://towardsdatascience.com/a-b-testing-is-there-a-better-way-an-exploration-of-multi-armed-bandits-98ca927b357d

--

--

Bruno Eidi Nishimoto

I graduated in Computer Engineering and work as a data scientist. My favorite topic is Reinforcement Learning. My hobbies are playing games and sports.