A brief overview of the Multi-Armed Bandit in Reinforcement Learning

Published in

Analytics Vidhya

3 min readMar 23, 2023

Introduction

Multi-arm bandits are one of the fundamental problems in reinforcement learning. The problem can be described as follows: there are multiple slot machines (arms), each with a different reward distribution, and the objective is to find the machine with the highest expected reward by playing them sequentially. This problem is commonly encountered in many real-world applications, such as advertising, healthcare, and finance, where the agent needs to choose the best option out of a set of alternatives. Multi-arm bandits are widely studied in the literature and have been shown to have important applications in various fields.

Types of Multi-Arm Bandits

There are two main types of multi-arm bandits: stochastic and adversarial. In stochastic multi-arm bandits, the rewards of the arms are generated from a fixed probability distribution, which is unknown to the agent. In adversarial multi-arm bandits, the rewards of the arms are chosen by an adversary, who tries to make the problem more challenging for the agent. The adversary can be either deterministic or randomized.

Algorithms for Multi-Arm Bandits

There are many algorithms for solving the multi-arm bandit problem, each with its own strengths and weaknesses. The most well-known algorithms are epsilon-greedy, UCB1, and Thompson Sampling.

Epsilon-Greedy: Epsilon-greedy is a simple algorithm that selects the arm with the highest estimated reward with probability 1-epsilon and selects a random arm with probability epsilon. The value of epsilon is usually chosen to balance exploration and exploitation. If epsilon is set to zero, the algorithm always selects the arm with the highest estimated reward, which can lead to suboptimal solutions if the estimates are inaccurate. On the other hand, if epsilon is set to one, the algorithm always selects a random arm, which can lead to inefficient exploration.

UCB: UCB (Upper Confidence Bound) is an algorithm that balances exploration and exploitation by selecting the arm with the highest upper confidence bound (UCB) estimate. The UCB estimate consists of two terms: the estimated reward and the confidence interval. The confidence interval is proportional to the square root of the logarithm of the number of times the arm has been played. The UCB algorithm has been shown to have good performance in both stochastic and adversarial environments.

Thompson Sampling: Thompson Sampling is a Bayesian algorithm that updates its beliefs about the reward distribution of each arm after each play. The algorithm then samples a reward from the updated distribution and selects the arm with the highest sample. The Thompson Sampling algorithm has been shown to have good performance in stochastic environments, but its performance in adversarial environments needs to be better understood.

Extensions of Multi-Arm Bandits

There are many extensions of the multi-arm bandit problem, such as contextual multi-arm bandits, collaborative filtering, and dynamic pricing. In contextual multi-arm bandits, the rewards of the arms depend not only on the arm played but also on a context vector that is observed by the agent. In collaborative filtering, the agent has to recommend items to users based on their preferences. In dynamic pricing, the agent has to set the price of a product to maximize revenue.

Conclusion

Multi-arm bandits are an important problem in reinforcement learning and have many real-world applications. There are many algorithms for solving the problem, each with its own strengths and weaknesses. The choice of algorithm depends on the specific problem and the properties of the reward distribution. There are also many extensions of the multi-arm bandit problem that make it more challenging and closer to real-world applications. The study of multi-arm bandits is an active area of research and is likely to continue to be so in the future.

For more in-depth knowledge about different Reinformcent Learning concepts, please read this amazing book — Reinforcement Learning: An Introduction by Richard Sutton and Andrew Barto, which has been the primary source of inspiration for this article as well.