Deep Q-Learning, Part2: Double Deep Q Network, (Double DQN)

An introduction and implementation tutorial with python3 and Tensorflow

7 min readApr 26, 2019

Last article, we have discussed the basic (Deep) Q-Learning and implemented it in a quick way. However, there are some disadvantages in the basic version and various Q-Learning algorithms are developed to overcome such drawbacks.

Today, we focus on one problem of overestimations of action value, Q-value, in basic (Deep) Q-Learning and introduce one popular solution, called Double Deep Q Network (Double DQN), a Double Q-Learning with Neural Network architecture. For the sake of practicality, we will only discuss the core concept of Double Q-Learning. The other details in paper are left for the readers who are really interested in. Finally, we implement Double DQN with Python3 and Tensorflow. Here are the sections we are going to discuss.

What is Double Q-Learning
Double Q-Learning Algorithm
Double Deep Q Network (Double DQN)
Implement Double Deep Q Network (Double DQN)

What is Double Q-Learning

Double Q-Learning ([1] H. van Hasselt 2010) was proposed for solving the problem of large overestimations of action value (Q-value) in basic Q-Learning.

Briefly, the problem of overestimations is that the agent always choose the non-optimal action in any given state only because it has the maximum Q-value.

In basic Q-learning, the optimal policy of the Agent is always to choose the best action in any given state. The assumption behind the idea is that the best action has the maximum expected/estimated Q-value. However, the Agent knows nothing about the environment in the beginning, it needs to estimate Q(s, a) at first and update them at each iteration. Such Q-values have lots of noises and we are never sure whether the action with maximum expected/estimated Q-value is really the best one.

Unfortunately, the best action often has smaller Q-values against the non-optimal one’s in most cases. According to the optimal policy in basic Q-Learning, the Agent tends to take the non-optimal action in any given state only because it has the maximum Q-value. Such problem is called the overestimations of action value (Q-value).

When such problem occurs, the noises from estimated Q-value will cause large positive biases in updating procedure. As a consequence, the learning process will be very complicated and messy.

Understanding why do positive biases happen

Recall: Loss Function of basic Q-Learning

Since the best Q(s, a) is estimated by using current Q(s, a), which is very noisy. The difference between the best and current Q(s, a) in Loss Function is also messy and contains positive biases which caused by different noises. These positive biases impact tremendously on update procedure.

By the way, if the noises of all Q-values have uniform distribution, more specifically, Q-values are equally overestimated, then overestimations are not the problem since these noises don’t impact on the difference between the Q(s’, a) and Q(s, a). More details are in [1] H. van Hasselt 2010, Section 2.

Double Q-Learning Algorithm

Double Q-Learning uses two different action-value functions, Q and Q’, as estimators. Even if Q and Q’ are noisy, these noises can be viewed as uniform distribution. That is, this algorithm solves the problem of overestimations of action value. Such proof is in [1] H. van Hasselt 2010, Section 3.

The update procedure is slightly different from the basic version.

Q function is for selecting the best action a with maximum Q-value of next state.

Q’ function is for calculating expected Q-value by using the action a selected above.

Update Q function by using the expected Q-value of Q’ function.

Completed Pseudocode

[1] H. van Hasselt. *Double Q-learning.* NIPS, 2010.

However, this proposed method is a tabular/matrix way, we have discussed such drawbacks in part 1. Based on the same reasons, we need to go deep now.

Double Deep Q Network (Double DQN)

Double Q-Learning implementation with Deep Neural Network is called Double Deep Q Network (Double DQN).

Double DQN is proposed in [2] H. van Hasselt, 2016. Inspired by Double Q-Learning, Double DQN uses two different Deep Neural Networks, Deep Q Network (DQN) and Target Network.

Note that there is no learning rate α when updating the Q-values since it will be used in the optimization stage of updating the parameters of Deep Q Network.

Deep Q Network — selecting the best action a with maximum Q-value of next state.

Target Network —calculating the estimated Q-value with action a selected above.

Update the Q-value of Deep Q Network based on the estimated Q-value from Target Network

Update the parameters of Target Network based on the parameters of Deep Q Network per several iterations.

Update the parameters of Deep Q Network based on Adam optimizer.

Implement Double Deep Q Network (Double DQN)

A step-by-step tutorial for Double DQN implementation.

Problem Definition

We use Gym library to derive the environment and solve the CartPole-v1 problem.

Step 1 — Experience Replay Construction

Experience Replay is a technique in Reinforcement Learning for Agent to make Agent learn from a batch of experiences/memories.

When the Agent takes an action in given state, it gets rewards and some information. These information can be stored in Agent as its experience. Each experience can be represented to the following list.

[state, action, reward, next_state, done]

With Experience Replay technique, the Agent can store a certain amount of experiences due to the memory limit. When the Agent accumulates enough experiences, it starts to train itself based on batches of experiences. This technique is used in lots of studies, such as [3] V. Mnih et al. 2015.

Step 2— Target Network Construction

The target network is for calculating estimated Q-value with action which was selected from Deep Q network (DQN).

For CartPole-v1 problem, the network architecture is described below.

Input Layer — 4 units since there are 4 features in state space of environment.
Output Layer — 2 units since there are 2 actions that Agent can take, right or left.
Hidden Layer 1— 250 units for demonstration.
Hidden Layer 2— 250 units for demonstration.

Step 3— Deep Q Network (DQN) Construction

DQN is for selecting the best action with maximum Q-value in given state.

The architecture of Q network(QNET) is the same as Target Network(TNET)

Input Layer — 4 units since there are 4 features in state space of environment.
Output Layer — 2 units since there are 2 actions that Agent can take, right or left.
Hidden Layer 1–250 units for demonstration.
Hidden Layer 2–250 units for demonstration.

In Line 4, we create a TNET in QNET for calculating estimated Q-value.

In Line 28, we need to update Target Network based on the parameters of Deep Q Network per several iterations.

According to step 1, the learning is based on batches of experiences. Therefore, we create specific parameters and optimization algorithm for batch learning, see the method _batch_learning_model(self).

Step 3-1 —Double DQN Algorithm for batch learning

In line 25–28, executing optimization and the parameters of QNET will update automatically.

Step 3–2 — Other methods

These methods are for the training stage of Agent. The Agent controls when to update target network or whether to explore the environment in accordance with the hyper parameter — exploration rate.

Step 4— Agent Construction

The Agent will train itself by using Double Q Network.

Step 5— Start Training

# See the information of training process
episodes: 0 to 100, average_reward: 31.782, exploration: 0.598
episodes: 100 to 200, average_reward: 136.340, exploration: 0.132
episodes: 200 to 300, average_reward: 271.540, exploration: 0.012
episodes: 300 to 400, average_reward: 289.860, exploration: 0.001
episodes: 400 to 500, average_reward: 345.220, exploration: 0.001
episodes: 500 to 600, average_reward: 332.240, exploration: 0.001
...
...

The result shows that Deep Q-Learning works well in CartPole-v1 problem. Now, it’s your turn.

Summary

In this article, we discuss about the most important part of Double Deep Q-Learning. Next part, we may discuss more about it or introduce another method of Deep Q-Learning.

Any feedbacks, thoughts, comments, suggestions, or questions are welcomed! Thanks.

Reference

[1] H. van Hasselt. Double Q-learning. NIPS. 2010.

[2] H. van Hasselt. , A. Guez, D. Silver. Deep Reinforcement Learning with Double Q-Learning, AAAI. 2016.

[3] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. A. Riedmiller, A.s Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, D. Hassabis. Human-level control through deep reinforcement learning. Nature 518(7540): 529–533. 2015.