Exploring Multi-Armed Bandit Problem: Epsilon-Greedy, Epsilon-Decreasing, UCB, and Thompson Sampling

11 min readNov 20, 2023

Introduction

The multi-armed bandit problem is a classic dilemma in the realm of sequential decision-making under uncertainty. It finds applications in a wide array of domains, from recommendation systems and clinical trials to online advertising and resource allocation. At its core, the problem revolves around efficiently allocating resources (such as time, money, or clicks) to a set of options, referred to as “arms” in order to maximize cumulative rewards.

What is the Multi-Armed Bandit Problem?

The multi-armed bandit problem embodies several key characteristics:

Stateless:
In the context of this problem, we operate in a stateless environment. This means that each decision made at any given time does not depend on prior decisions or states. It’s as if we are pulling levers on a row of slot machines (the “bandits”) independently.

No Change of Environment:
Unlike some decision-making scenarios where the environment can change or adapt over time, the multi-armed bandit problem (in general) assumes a static environment. The probabilities of receiving rewards from each arm do not change during the course of the problem. Thus, the challenge lies in discovering the arm that consistently yields the highest average reward.

Objective:
The primary objective in the multi-armed bandit problem is to identify and exploit the arm that, on average, produces the largest reward. This objective necessitates a careful balance between exploring the arms to gather information and exploiting the current best-known arm to maximize rewards.

Two Kinds of Environments Explored in the Context of the Multi-Armed Bandit Problem

There are two fundamental types of reward environments we can solve as the Multi-Armed Bandit Problem.

Discrete Reward:
In the discrete reward environment, each arm provides rewards from a finite set of possibilities. This discrete nature simplifies the problem to some extent but still presents interesting challenges in optimizing decision-making. An example of this type of reward is a click by a viewer of Internet Advertisement, i.e. 1 if a user clicks it while 0 if he/she does not.

Continuous Reward:
The continuous reward environment extends the problem to situations where arms yield rewards from a continuous range of values. This adds complexity, as decisions now involve real-valued rewards and demand more sophisticated strategies. An example of this type of reward is a user’s level of interaction or satisfaction with the recommended content by a recommendation system. It will be high (continuous measurement) when the recommended contents matches with the user’s interests and increases the engagement.

In this blog, we delve into the multi-armed bandit problem by focusing primarily on the discrete reward environment, which is a fundamental and commonly encountered scenario.

Algorithms We Will Learn Through This Blog

To tackle the multi-armed bandit problem effectively, we will learn several well-established algorithms, each with its own approach to balancing exploration and exploitation. These algorithms include:

Epsilon Greedy Algorithm:
The epsilon-greedy algorithm is a straightforward approach that balances exploration (randomly choosing an arm) and exploitation (choosing the arm with the highest estimated reward). It introduces an exploration parameter, epsilon $\epsilon$, which governs the trade-off between these two aspects.

Epsilon Decreasing:
Building upon the epsilon-greedy algorithm, epsilon-decreasing gradually reduces the exploration rate over time. This strategy allows for more exploration in the initial stages and transitions toward greater exploitation as the algorithm gathers more information.

Upper Confidence Bound (UCB):
The UCB algorithm calculates an upper confidence bound for each arm’s reward estimate, choosing the arm with the highest upper bound. This technique leverages uncertainty in reward estimates to guide exploration while favoring arms that are likely to be optimal.

Thompson Sampling:
Thompson Sampling is a probabilistic algorithm that models each arm’s reward distribution. It samples from these distributions to make decisions, naturally balancing exploration and exploitation by adapting to the observed rewards.

Throughout this blog, our primary focus will be on these algorithms applied to the discrete reward environment. By the end, you will have a comprehensive understanding of how to approach and address this intriguing problem in the context of discrete rewards. So let’s start!

First, we import packages.

Prepare a discrete reward multi-armed bandit environment

To begin, we set up a discrete reward multi-armed bandit environment. In this environment, we create ten arms, each with a mean reward probability that is uniformly sampled from the range [0, 1]. Each arm yields a reward of 1 with a probability equal to its mean reward probability and a reward of 0 with a probability of 1 minus the mean probability.

The above plot visualizes the mean probabilities associated with each arm. If we keep choosing arm1, we will receive a reward in 89% of the trials and this is the best performance we can achieve.

If we randomly explore an arm in each trial, what will be the probability of obtaining a reward? This will serve as our baseline performance against which we can later compare the performance of other algorithms. Let’s implement this algorithm. We will also implement utility functions for running simulations.

The “step” function within this class implements a random exploration algorithm. Now, we will conduct 1000 trials across 100 episodes and assess the algorithm’s average performance.

The average reward probability of random exploration is about 0.44, which is the mean reward probability across all arms. From the next section, we will explore different algorithms to solve the multi-armed bandit problem. We would like to make sure that the average reward probability of those algorithms is better than this value.

Epsilon Greedy Algorithm: Choose the most greedy option with probability 1-epsilon and choose randomly with probability epsilon

The multi-armed bandit problem poses a challenge: how can we maximize our rewards when faced with multiple options, each with an unknown probability of success? One straightforward approach is to try each option once and then focus on those that seem most promising based on our initial observations. For example, if we receive rewards [0, 0, 0, 0, 0, 1, 1, 1, 1, 1] from arms 0 to 9 in our first attempts, it might seem logical to randomly select one of the last 5 arms for subsequent trials.

However, there’s a catch. By doing this, we risk ignoring the first 5 arms, which yielded no rewards in the initial round. Their expected reward probabilities remain at zero, while the last 5 arms have higher expected probabilities due to their early successes. In such cases, we may miss out on arms with potentially higher rewards, like arm1 in our example.

The Epsilon Greedy Algorithm offers a solution by introducing a hyperparameter called epsilon (ε). This parameter balances exploitation (choosing the arm with the highest expected reward) and exploration (trying different arms to discover their potential).

Let’s dive into implementing the Epsilon Greedy Algorithm.

As the number of time steps increases, we observe a gradual rise in the average reward, which eventually stabilizes around the highest achievable reward value. This suggests that the Epsilon Greedy algorithm effectively identifies the best-performing arm among the ten available options. This trend is clearly evident in the right panel, where the count of explored trials is maximized for the arm with the highest reward probability (i.e. arm1). Furthermore, as we see in the middle plot, the average reward over time significantly outperforms that of the random algorithm, averaging around 0.44.

In this simulation, we set the epsilon value to 0.1. However, choosing the optimal hyperparameter value for epsilon is crucial. To address this, we will conduct simulations using various epsilon values to find the most effective choice.

In fact, an epsilon value of 0.1 yielded the highest average reward at each time step and over time steps. In the context of the Epsilon Greedy algorithm, an epsilon of 1 corresponds to pure random exploration, while an epsilon of 0 represents a purely greedy approach. Our exploration of different epsilon values serves the purpose of determining the optimal balance between exploitation and exploration.

Another noteworthy observation is that there remains a gap in the average reward between the best-performing Epsilon Greedy approach and the best arm reward. This discrepancy arises because the Epsilon Greedy algorithm continues to explore with a probability of epsilon, even after identifying the optimal arm. Ideally, we would prefer higher epsilon values at the beginning to promote exploration and gradually reduce epsilon’s value to emphasize exploitation and maximize our rewards. The Epsilon Decreasing algorithm, which we will learn next, offers a solution to address this issue.

Epsilon Decreasing Algorithm: Gradually reducing exploration

The Epsilon Decreasing algorithm introduces an additional hyperparameter, alpha α, which takes a value within the range 0 < α< 1.

α is used to scale epsilon at each time step, gradually reducing its value over time. This approach controls a balance between exploration and exploitation, emphasizing exploration early on and favoring exploitation as we learn more.

Let’s see the implementation of this algorithm.

Now, let’s explore the behavior of the Epsilon Decreasing algorithm by examining different combinations of epsilon and alpha values. As we covered, epsilon determines the initial exploration rate, while alpha dictates the rate at which epsilon decreases over time.
Let’s visualize how epsilon changes under different epsilon and alpha combinations.

For instance, consider epsilon=0.1 and alpha=0.99 (depicted by the blue line). Here, the algorithm rapidly transitions towards almost full exploitation after approximately 300 trials because the epsilon value approaches zero. Conversely, if we start with a higher epsilon, such as 0.5, it encourages a more extended period of exploration before eventually converging into a strong exploitation strategy. People tend to use higher epsilon values with Epsilon Decreasing compared to Epsilon Greedy. This design encourages substantial exploration in the initial phases, gradually shifting towards increased exploitation as the number of trials increases.

So what is the best combination of epsilon and alpha in this setup? As we did for epsilon greedy, let’s run hyperparameter search.

In this case, the parameter set {epsilon=0.5 and alpha=0.99} is the best-performing configuration (although this outcome may slightly vary depending on the simulated values). As expected, the average reward achieved with this optimal parameter set outperforms that of the Epsilon Greedy algorithm (which was about 0.81).

However, it’s essential to note that we now have an additional hyperparameter to fine-tune in this algorithm. While Epsilon Decreasing outperforms Epsilon Greedy, optimizing it requires increased computational resources to determine the most suitable hyperparameter values.

Upper Confidence Bound (UCB): Explore uncertain arms more frequently

So far, we have been randomly choosing arms during exploration trials. However, is this truly an effective strategy? Ideally, we would like to explore the options where we have a higher chance of observing good results rather than wasting time on options that are likely to perform poorly.

The Upper Confidence Bound (UCB) algorithm offers a solution to this problem. Instead of relying on random exploration, UCB prioritizes options with high uncertainty, as these have the potential to yield excellent results. In UCB, there is no clear distinction between exploitation and exploration trials; arms are selected based on the following equation:

In this equation, the first term represents the expected mean reward Q(a), while the second term quantifies the uncertainty associated with the option. The variable t represents the total number of trials conducted so far, and N(a) represents the number of times arm a has been selected. The second term in the equation will be relatively large for an arm that has been explored fewer times compared to the total number of trials. This encourages the algorithm to prioritize the exploration of arms with higher uncertainty.

Now, let’s implement this algorithm.

Note that in the step function, we initially iterate through all arms once to prevent a zero value in the logarithmic term and division by 0.

The parameter ‘c’ is a hyperparameter for the UCB algorithm and serves to balance exploitation and exploration. A larger ‘c’ value leads to more exploration, while ‘c=0’ corresponds to a purely greedy algorithm.

Let’s explore the best hyperparameter ‘c’.

With the best parameter, we observe further improvement from epsilon decreasing!

Thompson sampling: Use Bayesian framework to guide the exploitation and exploration

Lastly, I will introduce an algorithm using Bayesian framework — Thompson sampling. Thompson sampling balances the exploration of uncertain options and the exploitation of known high-reward options using Bayesian posterior. It updates a probability distribution over the reward probabilities of different arms (posterior) and uses this distribution to make decisions.

Here’s a high-level overview of how Thompson Sampling works:

1. Initialization: Initially, the algorithm assigns a reward probability distribution (often a Beta distribution) to each arm. The distribution represents the uncertainty or belief about the true underlying reward probability of each arm.

2. Arm Selection: For each time step, the algorithm samples a value from each reward probability distribution. These sampled values represent hypothetical reward probabilities for each arm.

3. Exploration vs. Exploitation: Thompson Sampling then selects the arm with the highest sampled value, thus favoring actions with a higher probability of yielding a high reward. However, since the sampled values are probabilistic, Thompson Sampling naturally explores other options as well.

4. Updating Probability Distributions: After observing the actual reward of the chosen arm, Thompson Sampling updates the probability distribution for that arm. This update incorporates the observed reward and adjusts the belief about the arm’s true reward probability.

5. Repeat: The algorithm repeats steps 2–4 for a predefined number of rounds or until a stopping criterion is met.

Let’s implement and test this algorithm.

One advantage of Thompson sampling is its ability to incorporate prior knowledge. By providing a well-informed prior to the algorithm, it can learn the optimal action more efficiently. Let’s examine how the choice of prior influences the algorithm’s performance. In this context, I will compare three scenarios:

Uniform prior
Good prior with many successful experiences with arm1 (i.e. the best arm)
Poor prior with limited successful experiences with arm1

A good prior (prior2) assists the algorithm in focusing on the best arm from the beginning, leading to optimal performance. In contrast, a poor prior (prior3) faces challenges in correcting initial misconceptions, resulting in a longer time to discover promising arms. This outcome underscores the significance of prior knowledge. However, it’s worth noting that the algorithm performs equally well, even with a uniform prior, compared to other algorithms I’ve introduced thus far. Thus, Thompson sampling is useful algorithm for balancing exploitation and exploration, even in the absence of prior knowledge. Moreover, if prior knowledge is available, the algorithm’s performance can be further enhanced.

Reference

Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
Russo, D. J., Van Roy, B., Kazerouni, A., Osband, I., & Wen, Z. (2018). A tutorial on thompson sampling. Foundations and Trends® in Machine Learning, 11(1), 1–96.