Multi-Armed Bandits: Exploitation versus Exploration (DRL series part 5)

3 min readJun 15, 2024

This article refers to the end of section 2.1 of chapter 2 about Multi-Armed Bandits from the book Reinforcement Learning by Richard Sutton (pages 26–27)

Talk about a packed chapter...

In the previous article, I wrote about the basic definition of the multi-armed bandit problem and how we define the value function. Today, we’re discussing a big topic: exploitation vs. exploration.

The discussion between when to do exploitation and when to do exploration has been widely discussed everywhere from classrooms to LinkedIn posts.

In order to understand this problem, first you need to understand what are greedy actions.

We’ve talked about the value function. But, in case you don’t remember, the value function of an action is the expected future reward of starting at a state and acting on that action.

Imagine that we know exactly what are the value functions of every single possible action you can do at every state (or for the multi-bandit that’s the same as the one only state), a greedy action is to choose the action that gives the highest value.

Imagine I told you, you have three options. You could buy stock A, B or C. I can tell you with certainty that stock A will yield 10% return, stock B will yield 5% return and stock C will yield 7% return. If you want to take a greedy action, you choose to buy stock A since it gives the highest return.

That’s a no-brainer, right? Then, what’s the problem? Well, we don’t know for sure the value of every single possible action. We have estimates of what we think are the true values of each action. So, if we act on a greedy action but our assumption were wrong, guess what? You lose.

That’s where the concept of exploitation versus exploration lives.

Exploitation is to always choose the greedy action while exploration is to choose the non-greedy action. So for example, if you’re trying to go as quickly as possible home and you have four possible paths to take, you have your beliefs on which path is the best, faster and more efficient to go home. But, just in case you’re wrong, every once in a while you do an exploration and take the path that you believe will take longer, just to gather more information. Maybe you’re right, maybe you’re wrong, but you’ll never know unless you try it. Of course, I’m simplifying the situation a lot so you can understand the difference between exploitation and exploration.

Exploitation is to always choose the greedy action while exploration is to choose the non-greedy action.

If you only do exploration, then you’re always choosing the action that you believe will yield the lowest return. And if you only choose exploitation, you could potentially be missing out on another action that yield a higher return. But if you balance the two, the hope is to get a bigger reward in the long run.

The principle is this, we need to balance exploration and exploitation so we can get the biggest reward in the long run. So we choose to possibly take a hit in a time step, so that in the future we’re getting the highest reward.

The challenge: what is the right balance? Not sure yet, but stick with me and we may find out.

We shall not cease from exploration, and the end of all our exploring will be to arrive where we started and know the place for the first time.
T. S. Eliot

Next: Multi-Armed Bandits: Action-value Methods and the 10-armed testbed (DRL series part 6)

Multi-Armed Bandits: Exploitation versus Exploration (DRL series part 5)

Written by Luiza Santos