Optimistic Q-Learning

Yassine Yousfi
Sequential Learning
7 min readMay 9, 2018

Q-Learning is one of the most famous Reinforcement Learning (RL) algorithms.

In this story we will discuss an important part of the algorithm: the exploration strategy. But before, let’s start by some introductory concepts and notations.

Reinforcement Learning (RL)

RL is a field of Machine Learning where an agent is connected to an environment via perception of states, choice of actions and receipt of rewards. At each step, the agent observes a state, chooses and executes an action, this changes its state and produce a reward.

Markov Decision Processes (MDP)

In the most traditional setting, RL solves MDPs. Even though in the heart of RL, we will not go into MDP theory in this story. Here is a nice introduction to MDPs.

We will solve the forest fire MDP available in the python MDP Toolbox.

A forest is managed by two actions: ‘Wait’ and ‘Cut’. An action is decided each year with first the objective to maintain an old forest for wildlife and second to make money selling cut wood. Each year there is a probability p that a fire burns the forest (and 1-p that the forest grows).

Let’s name our MDP (S, A, P, R, γ) where:

  • S is the state space (finite): 3 age-class of trees: age-class 0–20 years, 21–40 years, more than 40 years
  • A is the action space (finite): ‘Wait’ or ‘Cut’
  • P and R are the transition and reward matrices: whose closed forms can be easily expressed
  • γ is the discount factor representing the difference in importance between long term and short term rewards
  • A policy π is a stationary distribution over actions given the current state (Markovian property)

The goal is to find an optimal policy π* of the MDP without any knowledge of the MDP’s dynamic.

Note that if we had such knowledge, algorithms such as Optimal Value Iteration are able to find an optimal policy.

The optimal policy, that is to wait until the forest is in the oldest-age state and cut it, is quite intuitive here since we made the reward of cutting trees of the forest in its oldest state 5 times bigger than the growth reward (r1=10, r2=50).

Q-Learning

“Q” in Q-Learning stands for the policy’s (π) Quality function, which maps any state, action couple (s, a) with the expected accumulated discounted future reward obtained by selecting a when observing s.

Q-Learning is said to be “model-free”, which means that it doesn’t try to model the dynamic of the MDP, it directly estimates the Q-values of each action in each state. The policy can be then drawn by choosing the action with the highest Q-value in the each state.

If the agent visits all state-action pairs an infinite number of times, Q-learning converges to the optimal Q-function [1].

Again, we will not get into all the details of Q-Learning. If you are not familiar with it, here is a video by the awesome Siraj Raval explaining it.

And here is our python implementation of Q-Learning.

Note that the learning rate (alpha) used here is polynomial with ω = 4/5 following the results of [3].

The exploration strategy used here (ε-greedy) will be detailed later.

Exploration/Exploitation trade-off

Sequential Learning algorithms involve a fundamental choice:

  • Exploitation: Make the best decision given current information
  • Exploration: Gather more information by making other decisions

Successfully balancing between exploration and exploitation has great impact on the agent’s learning performance. Too much exploration prevents the agent from maximizing short-term reward because selected exploration actions may result in bad rewards from the environment. On the other hand, exploiting environment with incomplete knowledge prevents from maximizing long-term reward since selected exploitation actions may remain suboptimal.

ε-greedy exploration strategy

This strategy consists of choosing random actions with probability ε at each step.

It is probably the most used and simplest exploration strategy. Exploration is tuned by ε. In many implementations, ε is set to be decaying over time, but in many cases, setting ε to be constant works in practice.

“Optimism in Face of Uncertainty” exploration strategies

This concept was first introduced to solve the Stochastic Multi-Armed Bandit (SMAB) environment, a very old decision making process where a gambler has to decide which machine to play at each step in order to maximize the expected discounted reward given by the machines.

The gambler faces an exploration/exploitation trade-off, exploiting the machine (or arm) with the highest mean rewards (Follow the leader) or explore other machines that haven’t been played enough hoping to get higher rewards.

This SMAB formulation is very similar to the exploration problem in Q-Learning:

  • Exploitation: Choose the action with the highest Q-value in a given state
  • Exploration: Explore more actions (Choose either not enough visited actions or not visited at all).
Image from: Microsoft Research

Optimism in Face of Uncertainty (OFU) states: whenever we are uncertain about the outcome of an arm, we consider the best possible world and choose the best arm.

The intuition behind OFU is that:

  • If we are in the best possible world: OFU choses the best arm (no regret)
  • If we are not in the best possible world: the uncertainty is reduced (optimally)

One of the most famous OFU algorithms is the Upper Confidence Bound algorithm (UCB) [2]. And here is how we adapted it to Q-Learning. Let:

  • Q(s, a): The (scaled) Q value of action a at state s
  • N(t, s, a): Number of times the action a was chosen at state s at time t

The agent’s goal is to play: Argmax {Q(s, a)/ a ∈ A}. Which means choosing the action with the maximum Q value at state s. But the real Q(s, a) is unknown at time t:

We have: Q(s,a) = <Q(t, s, a)> + (Q(s,a) − <Q(t, s, a)>), with <Q(t, s, a)> being the estimated Q value at time t.

(Q(s,a) − <Q(t, s, a)>) corresponds to the error term that we can upper bound and play OFU.

Hoeffding’s inequality is one of the ways to bound this error. In fact, we can prove that w.h.p. when t grows:

The optimistic strategy can be written as: Argmax {Q+(t, s, a)/ a ∈ A}

Where β ≥ 0 tunes the exploration. When β = 0, the strategy is only exploiting past estimations (Follow the leader strategy).

This bound is the most used in the field. Many other works have been done around refining these bounds (UCB-V, UCB*, KL-UCB, Bayes-UCB, BESA [4], etc.).

Here is our implementation of the classical UCB exploration strategy and the results of its use in Q-Learning.

UCB exploration seems to reach high rewards very rapidly, however the training process is disturbed by early exploration, for more complex MDPs this can be beneficial as it helps the agent to get out of suboptimal solutions.

Let’s compare the two strategies more closely.

Conclusion and future work

Q-Learning is one of the most used algorithms in Reinforcement Learning. In this story we discussed the importance of exploration strategies and the use of UCB exploration strategy instead of the well-known ε-greedy exploration. More refined optimistic strategies may be used in Q-Learning in order to better balance between exploration and exploitation.

In our future story we will solve more complex problems using Deep Reinforcement Learning techniques, stay tuned!

References

[1] T. Jaakkola, M. I. Jordan, and S. P. Singh, “On the convergence of stochastic iterative dynamic programming algorithms” Neural computation, vol. 6, no. 6, pp. 1185–1201, 1994.

[2] P. Auer, “Using Confidence Bounds for Exploitation-Exploration Trade-offs”, Journal of Machine Learning Research 3 397–422, 2002.

[3] E. Even-Dar, and Y. Mansour, “Learning Rates for Q-learning”, Journal of Machine Learning Research 5 1–25, 2003.

[4] A. Baransi, O.-A. Maillard (also our RL professor), and S. Mannor, “Sub-sampling for multi-armed bandits”, Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 115–131, 2014.

--

--