Deep Double Q-Learning — Why you should use it

Ameet Deshpande
7 min readMay 26, 2018

--

It is an exception rather than the norm to not use Neural Networks (ANNs) for training Reinforcement Learning (RL)Agents because of the large problems that are being tackled today. Large state spaces require generalization, something ANNs are naturally good at. But their instability means that every possible optimization and improvement should be used to make the agent learn better. Double Learning is one such technique which improves the performance of the agent. This story requires the reader to have a basic idea of Q-Learning, one of the most famous TD (Temporal Difference) methods in RL. Here is one tutorial for the same. Most of the material discussed here has been influenced by Sutton and Barto’s book and the research paper (Van Hasselt et al.)from Deep Mind.

  1. Maximization Bias in Q-Learning

Let’s understand some problems that occur with Q-Learning. Maximization Bias is a technical way of saying that Q-Learning algorithm overestimates the value function estimates (V) and action-value estimates (Q). Consider the following example to see why this happens.

Say the agent is in state A and if it takes the action “right”, it goes to a terminal state and get’s a reward of 0, and if it takes the action “left”, it goes to a state B from which there are a large number of actions, all with rewards sampled from 𝓝(-0.1,1), which is a normal distribution with mean -0.1 and variance 1. In expectation, it would be beneficial to take the action right because it gets a higher expected reward of 0 (0 > -0.1). But the variance in rewards causes something unexpected.

The update for Q-Learning is the following.

It involves a max over all the actions in the next state. In expectation, all the actions from B give a reward of -0.1. But we know from statistics that the true expected value is reached only asymptotically. Given the large variance in rewards, it is quite possible that the initial few estimates of the actions might be positive or more negative. The following scenario is perfectly viable.

This toy example illustrates it for just 2 trials (very less in practice), but it is easy to see that this problem will die down only asymptotically. So when the update for state A is being performed, the max over actions of B is taken, as illustrated in the following equation.

The max over the actions in B means that the overestimated values of Action 2 will appear there because it has the highest estimate now. The equation will look something like this (considering average of samples as the estimate).

This example shows that the max that is being taken over all the actions will almost always lead to some overestimation.

2. The problem and the solution

The problem with Q-Learning is that the same samples are being used to decide which action is the best (highest expected reward), and the same samples are also being used to estimate that action-value. The following break-up of the Q-Learning update rule will make the above statement clearer.

What is happening here is that, if an action’s value was overestimated, it will be chosen as the best action, and it’s overestimated value is used as the target. Instead, we can have two different action-value estimates as mentioned in the above equation, and the “best” action can be chosen based on the values of the first action-value estimate and the target can be decided by the second action-value estimate. How do we train though?

Divide the samples into two halves. For the first half, we can use the above equation to train, and for the second half swap the “1” with the “2” in the equation. In practice, data keeps coming into the agent, and thus a coin is flipped to decide if the above equation should be used or the swapped one. How does this help?

Since the samples are stochastic, it is less likely that both the halves of the samples are overestimating the same action. The above method separates two crucial parts of the algorithm, choosing the best action and using the estimate of that action, and hence works well.

3. Reasons for Over-Estimation

Previous work on finding the reasons for overestimation attributed it mainly to two different aspects

  • Insufficiently flexible function approximation
  • Noise or Stochasticity (in rewards and/or environment)

The main contribution of Van Hasselt et al. was to show that the overestimation can happen in cases like Atari 2600 with ANNs as well, where function approximators are flexible (Universal Approximation Theorem) and stochasticity is very less (almost deterministic moves and outcomes).

It is worth-while to note that overestimating ALL the action-values does not change the policy learned, but when overestimation is not controlled and is different for different states, the optimal policy learned will be different.

4. A few comments

Though the solution discussed so far does combat the over-estimation problem, the number of samples used to learn each value estimate Q_1 and Q_2 is reduced by half. Lesser the data, worse the approximation. The DDQN paper uses a slightly different solution and will be discussed in the next section.

Also, why settle on just using two different action-value estimates? We could use an n-faced dice and have n different action value estimates and choose them by rolling the dice twice. There really is no reason why you would not want to do that, but the number of samples you are training each of the estimates on is going to reduce by n. I have not seen anyone use n > 2 in practice.

Van Hasselt et al. in their paper show that Deep Double Q-Learning not only improves accuracy in estimating the action-values but also improves the policy learned. As highlighted earlier, more accurate action-value estimates do not imply better policies. But by achieving the then state-of-the-art results on Atari 2600, they proved this point.

5. Deep Double Q-Learning

In this section, we discuss a few details in Van Hasselt et al.’s paper. Instead of using the action value estimates Q_2, they use a target network as a substitute. The target used by them is the following.

If the reader is not familiar with the term target networks which was introduced by DeepMind in their DQN paper, I would refer the reader to another Medium article. For completion sake, the update if target networks would not have been used would have been the following.

Though this is not completely theoretically backed, the advantage it has is that one does not need to make much change in the existing DQN code. It is literally a change of just one line in the pseudo-code. No extra parameters are introduced, and unlike the issue discussed in the first paragraph of the previous section (about sample efficiency), the parameters train on all the data points. And well, when someone gets the state-of-the-art results on a highly experimented benchmark like Atari, you just stand and watch in amazement.

A few remarks

  • Overestimation is not the same as over-optimism in the face of uncertainty, where the action-value estimates are initialized to value larger than expected to promote exploration. If the reader wants to read more about the same, I would refer them to Page 34 (Section titled “Optimistic Initial Values”) here.
  • Does SARSA also get affected by overestimation? There is no max anywhere is SARSA’s update rule, so it should not right? But remember that in SARSA, usually an ϵ-greedy policy is followed, and thus (1- ϵ) fraction of the times the action with the best estimate is chosen. Thus, overestimated values will affect the action chosen. So it would benefit SARSA to use Double Q-Learning as well!

PS: This is my first story here. Feedback of any kind will be appreciated!

--

--

Ameet Deshpande

NLP Ph.D. student at Princeton | Student Researcher at AI2