Exploitation and Exploration in RL

Swasti Khurana
Clique Community
Published in
3 min readJun 25, 2020

To understand these two terms let us take an example:

Restaurant management can decide to go on with its existing menu (all dishes remain the same) and slight changes are made to some recipes to add flavor and analyze customer behavior. Or the restaurant can decide to add/remove some dishes and then analyze customer behavior. This is precisely what exploitation and exploration are. Sticking to existing options or trying out some new ones.

In the example above, the restaurant management including the chefs are the agent. Customer satisfaction is the reward and the menu is the agent behavior decided.

So one question that might arise here is why explore?

For supervised and unsupervised learning we have a defined dataset (input) which the model uses to train itself. Reinforcement Learning does not have a specified dataset that defines right and wrong. In fact, it does not have any dataset at all. It evaluates actions taken rather than instructs by giving correct action. This creates a need to explore actions to find out the best possible one.

A visual representation of the choice between explore and exploit

Let us take an example of a problem where you have n distinct options to choose from (think of it as n action spaces). And this choice is provided repeatedly. Each of the n options(actions) has an action value, which is the reward for that action.

At = action at time t (choice of n)

Rt = corresponding reward

q∗(a) = value of arbitrary action ‘a’. Its value is: Expected reward given that ‘a’ is selected

The goal is to bring the estimated value of action ‘a’ close to the actual value q*(a)

The action value is not certainly known, but some estimates might be known. According to the known estimates, there is at least one action (in the action space) whose value is greatest. This action is the greedy action. As the estimate is higher for this action, it means this action has been chosen before and there is some knowledge about this action. Choosing this action would be exploiting known information about the values of actions.

If instead, we select the non-greedy choice of action we would be exploring. This would allow us to improve our estimates about the non-greedy action values.

Choosing between the two of them

Balancing rightly between exploitation and exploration is the key. On one step exploitation might be the right thing to do but for greater total reward, in the long run, exploration might produce better results. (There could be at least one of these other actions which are probably better than the greedy one.) During exploration, the reward is lower in the short run but higher in the long run. After discovering better actions, those can be exploited several times. But that can get tricky and depend on various factors including the number of steps remaining or the previous number of explorations conducted and exploits made, to name a few.

Some algorithms that can be used to balance the two are:

  • Greedy algorithm
  • ϵ-greedy algorithm
  • Upper Confidence Bound (UCB)
  • Thompson Sampling

For algorithms to balance the two, refer to:

Intuition: Exploration vs Exploitation

Balancing Exploration and Exploitation in Agent Learning

References:

Intro to RL by Sutton and Barton

--

--