Deep Q-learning (DQN) Tutorial with CartPole-v0

6 min readDec 15, 2023

In this series of articles, I have introduced various policy iteration algorithms to solve Markov Decision Processes (MDPs) such as Dynamic Programming, Monte Carlo Methods, and TD learning. While these methods, which utilize tabular estimates of state/action values, work well in simpler settings, they face challenges in complex environments due to the great number of states and actions, making it impractical to learn with tabular estimates.

One solution to this problem is to employ a sophisticated function to efficiently estimate values. In this article, I will introduce one such approach known as Deep Q-learning (DQN). DQN utilizes a neural network to efficiently estimate state values. To address challenges when applying a neural network to estimate the value function, DQN incorporates techniques such as experience replay and a target Q-network.

We will learn these crucial conceptual elements with the implementation. So, let’s get started

Prepare environment

In this article, we utilize the CartPole-v0 environment in Gymnasium.

Learn more about the CartPole-v0 environment

The goal of this environment is to balance a pole by applying forces in the left and right directions on the cart. It has a discrete action space:
- 0: Push cart to the left
- 1: Push cart to the right

Upon taking an action, either left or right, an agent observes a 4-dimensional state consisting of:
- Cart Position
- Cart Velocity
- Pole Angle
- Pole Angular Velocity

A reward of +1 is granted to the agent at each step while the pole is kept upright. The maximum reward an agent can earn in a single episode is 200.

The episode ends under the following conditions:
- Termination: Pole Angle is greater than ±12°
- Termination: Cart Position is greater than ±2.4 (center of the cart reaches the edge of the display)
- Truncation: Episode length exceeds 200 steps

In the code below, I provide an example of the agent randomly exploring this environment over 20 time steps.

Define Q-Network

Let’s begin by preparing a model to estimate a value function. Here, we assume that the environment dynamics is unknown and the agent needs to learn the optimal policy by interacting with the environment. It’s important to note that the state space of CartPole-v0 is continuous, and the number of possible states is infinitely large. Therefore, applying a tabular estimate strategy, as done in Monte Carlo Methods or TD-learning, is not feasible. This is why we use a model to estimate the value based on the observed state.

For Deep Q-Network (DQN), we employ a deep learning model as a value function estimator. The estimated value function is referred to as the Q-value, and the neural network used for this purpose is known as the Q-network. This network takes a 4 dimensional state as an input and outputs Q values of the 2 dimensional actions available to the agent. To keep it simple, we will utilize a fully connected neural network with ReLU activation. We use the Adam optimizer for training.

The Xavier initialization (or Glorot initialization) is a popular technique for initializing weights in a neural network. For more information, you can check this article.

Define Replay Memory

Next, we introduce one of the key components of DQN, Experience Replay.

While deep learning models or other complex machine learning models for estimating a value function may seem appealing, they did not enjoy success in RL for a long time until Experience Replay was introduced. This is due to a correlation within the training dataset. In supervised learning, a crucial assumption is that all training data is independent and identically distributed (i.i.d). However, consecutive samples collected during the agent’s interactions with the environment in RL are highly correlated, breaking this assumption and causing issues in fitting a machine learning model.

Experience Replay mitigates this temporal correlation by randomly sampling experiences from a replay memory. This approach helps decorrelate training samples, promoting better convergence. By storing and reusing past experiences, Experience Replay enables the agent to learn from a diverse set of transitions, making more efficient use of the collected data. This diversity aids in better exploration and can lead to more stable and faster convergence of the Q-network.

Specifically, we first define a replay memory with a specified memory size. On each iteration, we store a new observation in this replay memory and randomly sample data from this memory to train a model. This random sampling from a replay memory breaks the correlational structure in the training data.

Because of Replay Memory, DQN is classified as an off-policy method. On-Policy methods use the same policy for both collecting samples and updating the value function. In contrast, Off-Policy methods use two distinct policies for collecting samples and updating a value function, respectively. For DQN, experience samples are collected with an epsilon-greedy policy by using the Q-network value estimate at the time. The accumulated experience in the memory buffer is then used to update the Q-network, consequently updating the policy by taking greedy action based on Q-network estimate. The data accumulated in the memory buffer does not follow the same policy as the current policy because we keep updating the Q-network. Thus, the data used to update our target policy is collected with different policies (i.e., Off-policy).

Below is the code to set up the replay memory.

Q-Value Function Update

As we discussed in the above section, to address the correlation problem in training data, we leverage a replay buffer. Samples extracted from the memory buffer are used to train our model. Let’s learn the detailed process of updating a model.

Firstly, we utilize the Temporal Difference (TD) error to update the value of a state and action pair, following the principles of Q-learning, which is why this approach is termed Deep “Q-learning”. Q-learning is an Off-policy TD learning algorithm.

The TD error is given by the formula:

A crucial difference between an original Q-learning and DQN is that the current estimate of the Q-value Q(Sₜ, Aₜ) and Q(Sₜ₊₁, a) is obtained from a deep learning model. The model updates the prediction to minimize this TD error, meaning it uses the mean squared error between the estimated Q-value and the sum of the immediate reward and the value of the best next state-aciton pair.

However, there is a challenge. While our aim is to update the Q-value estimate by minimizing the TD error, the computation of the next state value Q(Sₜ₊₁, a) within TD error equation also relies on the network, which we update at each iteration. As our value estimate for the next state fluctuates every iteration due to the network update, the error fluctuates too, leading to instability in the Q-network fitting.

To overcome this issue, DQN introduces another neural network called the target network. This network is updated more slowly compared to the policy Q-network, which we use to define the optimal policy based on the value estimation. Unlike the policy network, the target network is not updated every iteration. Instead, it is updated once in a while to stabilize the learning process.

Define DQN Agent

Now that we have covered all the essential components (Deep Learning Model, Replay Memory, and Target Network), we can proceed with the implementation to train a DQN agent. The training of DQN in the below implementation roughly follows these steps:

Let’s explore the implementation.

Train and Test DQN Agent

Lastly, let’s train a DQN agent to evaluate its performance on the CartPole-v0 environment. In this environment, the maximum achievable reward for an agent in each episode is 200. We will train the model over 200 episodes, evaluating its performance every 10 episodes with an additional 20 test episodes. This training process will be repeated five times, allowing us to assess the average performance of the agent across episodes.

Please note that this section may take a few minutes to complete.

The plot above illustrates the average return, representing the total reward received by the agent in an episode, across 5 different runs. It is evident that the agent progressively learns a more effective strategy to achieve higher returns with an increasing number of episodes!

Codes

Ref

- Mnih, Volodymyr, et al. “Human-level control through deep reinforcement learning.” nature 518.7540 (2015): 529–533.