The Multi-Armed Bandit Problem in Reinforcement Learning

Amit Ranjan
Analytics Vidhya
Published in
4 min readDec 5, 2020


In this article you will understand about Reinforcement Learning, the famous Multi-Armed Bandit Problem, its application and some strategies to solve the problem.

The Multi-Armed Bandit Problem is a very popular problem in Reinforcement Learning. But Hey, what is Reinforcement learning is?

Reinforcement Learning is a type of machine learning algorithm which is a reward and punishment based algorithm where for correct predictions we reward the machine and for incorrect predictions we punish the algorithm.
Of course we will not give some food or money to machine in case of reward but we can provide binary 1 for correct predictions and binary 0 for incorrect ones.

Now coming to the famous Multi-Armed Bandit Problem. If you go back in 90s you will find the slot machines used in casinos were having a single-armed lever.


You would place the coin inside the machine and pull the lever or single arm you can see in above picture. Then the three reels would roll and if you are lucky enough you will get the reward. If you want to see actually how it works you can also look up to an Youtube video. Here is the link.

These machines were called Single-armed bandits. You might be wondering that these are the slot machines right then why they are called bandits? It would be proper to call them single-armed slot machine.

The reason why they are called bandits is these slot machines were build to rob you of your money. Sounds funny right but that is why they are called bandits.

What is Multi-Armed Bandits?
When you go to the casinos back in 90s you would find these multiple single-armed slot machines. Suppose there were 5 slot machines and your challenge is to figure out the ways to maximize your profits from the number of games played on all of these 5 machines.

So you have to explore all of these machines and exploit the best one to maximize the profits.

Let’s discuss more on these exploration and exploitation techniques.

Exploration is going onto all the 5 slot machines enough times to understand which one will maximize your profits.

Exploitation comes when you have a pretty good idea about which machine will give you maximum profits and you play on that machine only to maximize your profits.

This is the Multi-Armed Bandit Problem!

It can be applied to many real life scenarios like,

  1. Picking restaurants where eating gives you maximum happiness.
  2. Deciding which course to take as your college major.
  3. Finding best portfolio in investment planning etc.

Some strategies in Multi-Armed Bandit Problem

Suppose you have 100 nickel coins with you and you have to maximize the return on investment on 5 of these slot machines. Assuming there is only one machine out of 5 which will give you maximum profit.

Strategy 1: Explore only
In this strategy what you will do is you will divide 20 coins for each of the slot machines and play with them. This might be the wise approach but it will not give you the maximum profit since out of 5 only one will give maximum profit and others might be very less.

Strategy 2: Exploit only
In this strategy you will try on all the 5 machines once and will continue on that machine where you find the maximum profit with rest of the coins. But here is the catch. Suppose at first time you get the maximum profit on Machine 1 but the Machine 5 can give maximum profit in long run. Then with this strategy you would be investing in Machine 1 itself and not getting the maximum return as you might get in Machine 5.

Strategy 3: ε — Greedy Method
In this strategy we start by setting some lower value of ε (epsilon) like 5% or 10% something like that. It says that at any time there is 5% chance that we are going to randomly pick a slot machine. There is 95% chance that at any given time we are going to play on that slot machine which has historically given us the maximum profit so far.

Let’s say at the end of 20th coin that is on 21st coin we run our random number generator and we find that based on 5% chance we should be exploiting our current knowledge today. So, what we do is that we look through the first 20th coin experience and find which machine has given us the best profit and we play on that slot machine.

Now let’s say its 22nd coin and we again look at our random generator and find that at this time we take 1/5th chance of visiting any of the 5 slot machines to purely get another machine knowledge. So that we become even more sure of our decision when the next exploit day comes.

The performance is going to depend on the choice of epsilon ε itself. Hence this will be better strategy in comparison to Strategy 1 and Strategy 2.

There can be many strategies like UCB method, Thompson Sampling etc. These are the better strategies which we will be see in another article.

This article is purely on Multi-Armed Bandits Problem!