Multi-armed Bandit Algorithms

Published in

Getir

5 min readMar 5, 2023

Multi-armed Bandit problem is a hypothetical example of exploring and exploiting a dilemma. Even though we see slot machines (single-armed bandits) in casinos, algorithms mentioned in this article are not useful for a gambler since she would be kicked out from the casino if someone realizes what she does 🙂

Before explaining what multi-armed bandit algorithms are, let me first explain what they are used for. These days, A/B testing is a crucial process in decision-making for most tech companies. In A/B testing at least 2 variations of a feature/design/algorithm are compared and one of them is chosen. In my previous medium article, I mentioned a methodology for A/B testing. However, sometimes there aren’t feasible conditions for an A/B test for a business.

With A/B testing:

You allocate the same amount of samples for each variant
You expect a statistical result with a statistical metric (p-value is an example)
So you explore!

Keep in mind that in the A/B testing process, there will inevitably be one variant that performs less than another, but this is what you want! You want to have a big difference in your results to establish a clear statistical result. Now let’s imagine you have an e-commerce app and, you make campaigns for a short period of time such as black Friday. You want to change your app icon and you have 3 different and interesting designs for the icon. Since you can show these designs to customers for only one day, you can not show 3 icons to the 3 groups and then decide on the best performing. You do not have TIME! You want an algorithm to show 3 of them in the beginning but choose the best-performing icon and show that to everyone in a short period of time, automatically. Exactly this is where you would use multi-armed bandit algorithms.

Therefore, with Multi-armed bandit algorithms:

Explore a little but exploit the best one, most of the time
Optimize your revenue/conversions (or whatever your metrics are)
Don’t need a statistical result.

After understanding why you would want to use a multi-armed bandit algorithm, let’s see how they work!

Greedy Algorithm — A naive approach

To understand the following algorithm better, we can start with a naive and simple approach. As the mentioned example above, let’s say you have 3 variants to choose from. The easiest approach would be to try every one of them once and choose the one which is the best-performing one. However, this approach penalizes the one which starts with a bad-performing variant in the first steps. To give an example, if you show the app icon version 1 to a customer and that customer would not click the app icon, and another customer sees the app icon version 2 and clicks the app icon your algorithm will never try version 1 since it has 0 conversions!

This algorithm can be seen as an epsilon-greedy algorithm with epsilon = 0.

Epsilon-Greedy Algorithm

In this algorithm we set an epsilon value, let’s say 10%. This means we use the version which has the highest avg metric value 90% of the time and uses any random version 10% of the time. This provides other versions a chance to shine.

Github: Epsilon-Greedy Code

Optimistic Initial Values Algorithm

With this approach, in the beginning, all versions are set to high (optimistic and also impossible) metric values.

Github: Optimistic Initial Values Code

Pseudo Code — Optimistic Initial Values Algorithm

Upper Confidence Bound (UCB1) Algorithm

This algorithm has a formula for the upper confidence bound to add the average performance of the version. This addition in the formula gets lower when the bandit is played more.

Github: UCB1 Code

Bayesian (Thompson Sampling) Algorithm 👑

(This method is my favorite one)

This algorithm uses bayesian statistics where the distributions of metric values for each version are simulated. At first, the parameters for that distribution are initialized the same for all versions and then updated with the data from the experiment with conjugate prior & posterior formulas.

Github: Bayesian Code

This is a strong method with good interpretations, however, the formulas for other metrics than conversion are different and should be implemented right by considering the distribution of the metric mean.

Conclusion

All of these algorithms (the naive one is excluded of course 🙂) are useful tools for a use case mentioned in the introduction part. In this article, I have mentioned them without details in pseudo-codes. However, if you are curious about them and want to implement them, you should see my GitHub repository for those algorithms. Also, some simulation results can be seen in the repo as well.

Resources

Me

Get in touch: My LinkedIn Profile
Source Codes: My GitHub Repo