Double DQN Implementation to Solve OpenAI Gym’s CartPole v-0

6 min readFeb 5, 2019

This post describes a reinforcement learning agent that solves the OpenAI Gym environment, CartPole (v-0). The agent is based off of a family of RL agents developed by Deepmind known as DQNs, which constitute a new field of reinforcement learning called Deep Q Learning. This post will go into the original structure of the DQN agent, additional enhancements in this agent, the code implementation, and the results of testing the implementation in the CartPole v-0 environment. It is assumed the reader has a basic familiarity with reinforcement learning. The implementation here was written using several other implementations as guides, which are listed at the bottom of the post.

Agent Structure

The paper introducing the original DQN described it as an agent designed to “learn successful policies directly from high-dimensional sensory inputs using end-to-end reinforcement learning” and showed that it performed much better than any previous RL agent playing a set of classic Atari games[1]. The agent implemented here largely follows the structure of the original DQN introduced in this paper but is closer to what is known as a Double DQN, an enhanced version of the original DQN introduced nine months later that utilizes it’s internal neural networks slightly differently in its learning process[2].

Technically the agent implemented here isn’t really a DQN since its networks only contain a few layers and so are not actually that “deep” but the distinguishing ideas of Deep Q Learning are still utilized. An additional enhancement changing how the agent’s neural networks work with each other, known as “soft target network updates”, which was introduced in another Deep Q Learning paper [3], is used by the agent described here.

Q Learning

DQN agents develop their policies with Q learning, a reinforcement learning technique that is centered around the idea of Q values and a Q function: for every unique pair of a specific state of the environment and a specific action to take there is a Q value which represents the value or utility of taking that specific action when that specific state is in place. That Q value is calculated by the Q function and has the general formula:

f(s,a) = Q where s= state, a = action, Q = Q value

In the case of CartPole, the state is represented by 4 values — cart position, cart velocity, pole angle, and the velocity of the tip of the pole — and the agent can take one of two actions at every step — moving left or moving right.

Deep Q Learning

In Deep Q Learning, the agent uses uses two neural networks to learn and predict what action to take at every step. One network, referred to as the Q network or the online network, is used to predict what to do when the agent encounters a new state. It takes in the state as input and outputs Q values for the possible actions that could be taken. In the agent described here, the online network takes in a vector of four values (state of the CartPole environment) as input and outputs a vector of two Q values, one for the value of moving left in the current state, and one for the value of moving right in the current state. The agent will choose the action that has the higher corresponding Q value output by the online network.

Below is the method used in this implementation for using the online network to choose an action:

Experience Replay

DQN agents learn and improve themselves through a method called experience replay, which is where the second network, called the target network comes into play. In order to carry out experience replay the agent “remembers” each step of the game after it happens, each memory consisting of the action taken, the state that was in place when the action was taken, the reward given from taking the action, and the state that resulted from taking the action. These memories are also known as experiences. After each step is taken in the episode, the experience replay procedure is carried out (after enough memories have been stored) and consists of the following steps:

1) a random set of experiences (called a minibatch) is retrieved from the agent’s memory

2) for each experience in the minibatch, new Q values are calculated for each state and action pair — if the action ended the episode, the Q value will be negative (bad) and if the action did not end the game (i.e. kept the agent alive for at least one more turn), the Q value will be positive and is predicted by what is called a Bellman equation (described nicely here). The general formula of the Bellman equation used by the agent implemented here is:

value = reward + discount_factor * target_network.predict(next_state)[argmax(online_network.predict(next_state))]

3) the neural network is fit to associate each state in the minibatch with the new Q values calculated for the actions taken

Below is the experience replay method used in the implementation:

Why are two networks needed to generate the new Q value for each action? The single online network could be used to generate the Q value to update, but if it did, then each update would consist of the single online network updating its weights to better predict what it itself outputs — the agent is trying to fit to a target value that it itself defines and this can result in the network quickly updating itself too drastically in an unproductive way. To avoid this situation, the Q values to update are taken from the output of the second target network which is meant to reflect the state of the online network but does not hold identical values.

Soft Target Network Updates

The method used to update the target network’s weights in the original DQN paper [1] is to set them equal to the weights of the online network every fixed number of steps. Another established method to update the target network weights, which is used in this implementation, is to incrementally update the target network weights to reflect the online network weights after every run of experience replay with the following formula:

target_weights = target_weights * (1-TAU) + q_weights * TAU

where 0 < TAU < 1

This method of updating the target network is known as “soft target network updates” and was introduced in [3]. The method used in the implementation for these target network updates is shown below:

What makes this network a Double DQN?

The Bellman equation used to calculate the Q values to update the online network follows the equation:

value = reward + discount_factor * target_network.predict(next_state)[argmax(online_network.predict(next_state))]

The Bellman equation used to calculate the Q value updates in the original (vanilla) DQN[1] is:

value = reward + discount_factor * max(target_network.predict(next_state))

The difference is that, using the terminology of the field, the second equation uses the target network for both SELECTING and EVALUATING the action to take whereas the first equation uses the online network for SELECTING the action to take and the target network for EVALUATING the action. Selection here means choosing which action to take, and evaluation means getting the projected Q value for that action. This form of the Bellman equation is what makes this agent a Double DQN and not just a DQN and was introduced in [2].

Summary

By running through experience replay every time the agent takes an action, and updating the parameters of the online network, the online network will begin to associate certain state/action pairs with appropriate Q values — the greater promise there is for taking a certain action at a certain state, the model will begin to predict higher Q values, and will start to survive for longer as the agent keeps playing the game.

“Solving” the CartPole (v-0) environment is defined as getting an average score of 195 across 100 successive episodes. Below is a graph showing the number of episodes that were played until the agent implemented here solved the environment in ten different trials. Out of the ten trials, only one did not solve the environment and was stopped after completing 2000 episodes.

The graph below displays the scores of each episode in the second trial and shows the gradual improvement of the agent as it keeps playing and learning. Eventually (around step ~95) the agent starts playing CartPole well enough that each episode ends because the maximum number of steps per episode (200) are taken.