Reinforcement Learning: Deep Q-Learning with Atari games

Cheng Xi Tsou
Nerd For Tech
Published in
11 min readJul 8, 2021

In my previous post A First Look at Reinforcement Learning, I attempted to use Deep Q learning to solve the CartPole problem. In this post, I will be further exploring Deep Q learning but in the context of Atari games.

In 2013, the paper by the Deepmind team Playing Atari with Deep Reinforcement Learning (Mnih et. al) explored the notion of using Deep Q learning on Atari games. The goal was to present a deep learning model that could deal with high dimensional inputs and achieve superhuman benchmarks across different Atari games with no adjustment in the underlying architecture of the model. After reading the paper, I wanted to try and implement such a model and hopefully be able to train an agent to play some classic Atari games such as Breakout or Pong. I will also be implementing ideas on the follow-up paper Human-level control through deep reinforcement learning published in the British journal Nature. In addition, I will also talk about a modification we can add called Double DQN.

Firstly, I needed to choose an environment for this problem. When I first started this project, I wanted to try and solve the game Tennis. However, after a couple of hours of training my model, the results were no better than random play. Looking at my training parameters, I had only trained my agent for about 50 episodes, where each episode took a maximum of 200 frames to finish. So in total, about 10,000 frames. Looking at the Deepmind paper, they had trained their agent over 10 million frames. While my 10,000 frames took about 1.5 hours to run, 10 million frames would’ve taken me 2 months! In the end, I decided to go with Pong, one of the easier games to learn.

A bot (left) playing against an RL agent (right)

Environment Setup

For this experiment, I will be using OpenAI’s gym library with prebuilt environments. Note that currently, the only environment in OpenAI’s atari-py package is Tetris, so you will have to import ROMs from elsewhere. I will be including the file in my code at the end of the post. Here is the environment:

#initial environment
env = gym.make('PongNoFrameskip-v4')
#Atari preprocessing wrapper
env = gym.wrappers.AtariPreprocessing(env, noop_max=30, frame_skip=4, screen_size=84, terminal_on_life_loss=False, grayscale_obs=True, grayscale_newaxis=False, scale_obs=False)
#Frame stacking
env = gym.wrappers.FrameStack(env, 4)

In the first line, I initialize the environment ‘PongNoFrameskip-v4’. The format that gym takes is a concatenated string of [‘Game][‘NoFrameskip’, ‘Deterministic’, None][‘-v0’, ‘-v4’]. Here’s a quick explanation of each term:

  • “NoFrameskip”: Each step of the environment is one frame
  • “Deterministic”: Each step executes the same action for k frames and returns the kth frame. k = 4.
  • None: Same as “Deterministic” but k is sampled from [2, 5].
  • “-v0”: Each step has a 0.25 probability of repeating the previous action
  • “v4”: Each step will follow your issued action

In the second line, I apply some preprocessing that was detailed in the Nature paper. The Atari wrapper follows the guidelines in Machado et al. (2018), “Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents”. A quick explanation of each parameter:

  • NoopReset: obtain initial state by taking a random number of no-ops (no action) on reset.
  • Frame skipping: 4 by default
  • Max-pooling: Takes the max pixel value of most recent observations to remove the flickering effect in some Atari games.
  • Termination signal when a life is lost: turned off by default. Not recommended by Machado et al. (2018).
  • Resize to a square image: 84x84 by default
  • Grayscale observation: Converts observation to grayscale
  • Scale observation: Scales observation to [0, 1]

Note that we will not scale the observation and keep value type as uint8 for more efficient memory storage and will only scale before feeding it into the model.

The third line stacks every k frames into one observation. This will give the model an idea of how an action plays out rather than just having the input be a still frame. Note that this does not mean we are grouping every k frames together, the observations are stacked for each frame so they will be overlapped. Our observation dimension should be (4, 84, 84) where inputs are images of size 84x84 and stacked with the last 4 frames. An interesting thing to note about the preprocessed observations is that the Nature paper chose to keep the in-game score rather than crop it out.

The score is outlined in a black box

Implementing Deep Q Learning

We will be using an epsilon-greedy algorithm for choosing the best action, where there is an epsilon chance of sampling a random action from the action space. Instead of using epsilon decay, we will be using linear annealing to decrease epsilon from 1 to 0.1 over 1 million frames by Deepmind’s specification.

Similar to the baseline Deep Q-learning algorithm I described in my previous post, we will be using a neural network to learn the Q values of a particular state instead of a lookup Q table. This is because with a more complex environment such as Atari games, the memory needed to store all possible states is too inefficient and it would require a far longer time for the Q values to converge. We will be using the following equation to get the target Q value and let the network update its weights.

Target q value for a given state and action

As we are dealing with images as inputs, the network will be a convolutional neural network. As detailed in the Nature paper,

The input to the neural network consists of an 84 × 84 × 4 image produced by the preprocessing map w. The first hidden layer convolves 32 filters of 8×8 with stride 4 with the input image and applies a rectifier nonlinearity. The second hidden layer convolves 64 filters of 4×4 with stride 2, again followed by a rectifier nonlinearity. This is followed by a third convolutional layer that convolves 64 filters of 3×3 with stride 1 followed by a rectifier. The final hidden layer is fully-connected and consists of 512 rectifier units. The output layer is a fully-connected linear layer with a single output for each valid action.

My implementation in Keras is shown below:

model = Sequential(
[
Lambda(lambda tensor: tf.transpose(tensor, [0, 2, 3, 1]), output_shape=(84, 84, 4), input_shape=(4, 84, 84)),
Conv2D(32, kernel_size=(8, 8), strides=4, activation="relu", input_shape=(4, 84, 84)),
Conv2D(64, kernel_size=(4, 4), strides=2, activation="relu"),
Flatten(),
Dense(512, activation="relu"),
Dense(env.action_space.n, activation="linear"),
]
)
rms = tf.keras.optimizers.RMSprop(learning_rate=0.00025, rho=0.95, epsilon=0.01)
model.compile(loss="mse", optimizer=rms)

In my implementation, I included a Lambda layer that converts the inputs of the format NCWH (Batch, Channel, Width, Height) into NWHC. This is because as I’m training my model with CPU (couldn't configure GPU), the format NCWH is not supported. If you are training on GPU, then you can ignore the Lambda layer and use (4, 84, 84) as the input shape of the first Convolutional layer. The specifications of the optimizers are detailed in the Nature paper. For more information about RMSprop, I found this article to be very helpful.

We will also be using experience replay and keep track of past transitions in a “memory” array. The parameters described in the paper store up to 1 million transitions in memory. The memory will be prefilled with 50000 transitions from random play. During minibatch updates, a batch will be uniformly sampled from the memory to be used for training. The paper also mentions a more sophisticated strategy that emphasizes transitions where the model learns the most, similar to prioritized sweeping. Another modification is detailed in the paper:

The modification states to clone the network every C updates and uses the cloned network as a target network for the target Q value and trains the online network rather than using the same network for both tasks. This is to reduce instability in the network as every time we update the network, the target will change as well. Using a target network will ensure the target doesn’t change for C updates. As you can see below, the differentiated loss function uses model parameters θᵢ for the current Q value and cloned model parameters θᵢ⁻ for the target Q value.

Gradient when differentiating the loss function

A summary of the algorithm is shown below:

Double DQN

Another modification we can make to this algorithm is called Double Deep Q Learning. In the paper Deep Reinforcement Learning with Double Q-learning (Hasselt et. al 2015), the problem of overestimating q values is addressed. Although having overoptimistic values isn’t necessarily a bad thing, it can negatively affect the resulting policy if action values are not uniformly skewed.

The change proposed in the paper is regarding the target value during training.

When we use a neural network to train our agent, we want to update our model based on the difference between what our model thinks its Q value is versus what our model thinks the Q value should be. The equation above shows that our target is the reward + discounted future utility, where we use a neural network to predict action values and select the highest one. The problem here is that we are using the max operator, where we are using the same values to select an action and also evaluate that action. This can lead to over-optimistic value estimates, so we want a way to decouple selection from evaluation.

The equation above is how we adjust the DQN target equation. Instead of using the max operator to choose the best action and get the value of the action at the same time, we are using argmax to select an action using network Q with weights θₜ and then evaluating that action using network Q with another set of weights θ′ₜ. The paper combines the concept of Double Q learning with DQN to create a simple Double DQN modification, where we can use the target network as weights θ′ₜ and the online network as weights θₜ.

Let’s take a closer look at how the overestimation causes a suboptimal policy. Take this theorem proposed in the paper:

The proof can be found in the paper (Check sources below for link)

When we have unbiased error such that the sum of differences of action values between true optimal network V and arbitrary network Qₜ with a mean squared error C > 0, we can derive a tight lower bound: √(C/m-1), where m is the number of actions. With Double Q-learning, the lower bound on the error is 0.

#action selection with online network
best_future_action = np.argmax(model.predict(np.expand_dims(new_state, axis=0)))
#action evaluation with target network
target = reward + discount_factor * target_model.predict(np.expand_dims(new_state, axis=0))[0][best_future_action]

My results

After implementing the algorithm, I was ready to train my agent. However, I did severely underestimate the training time. In the Nature paper, they trained the agent for 50 million frames, an equivalent of 38 days of playing time. As for me, I chose to use an easier Atari game that would take a shorter time to converge, which is around 750,000 frames according to some examples I’ve seen. Even with the reduced training time, training on my local CPU would’ve taken days, and I had run into memory issues. I decided to train my model for 200,000 frames with other hyperparameters being scaled to the number of trained frames.

First, let's see how we did with a random policy. Over 150 playthroughs, the random policy had an average score of -20.273, where the score is calculated by Cpu score - Agent score, and the game ends at 21 points. As expected, the random policy is not very good compared to the human benchmark of -3.

After training my model, I did not get the results I expected. The score did not improve over time and the resulting policy was just as good as a random policy. Checking the training loss history, it did seem like the model was reducing loss.

Training Pong over 150 episodes
Training Loss of Pong

So why was my model not training? A couple of hypotheses on why this was happening. First, I was scaling down all hyperparameters and moving from 10 million frames trained to 200,000 frames. Although everything is still to scale, there is a major problem. The number of frames it takes to solve the problem is still the same. As I’m decreasing epsilon over the first 20,000 frames, the learned policy is still just as good as random policy.

Secondly, this could be due to the sparse nature of rewards in Pong. In the CartPole environment, we rewarded the agent for each timestep that the pole was upright. This kind of feedback allowed for the agent to quickly distinguish between good and bad actions. We then modified the reward to be scaled with the pole angle for even more precise feedback. This is a different case for DQN however. DQN was first proposed as a general solution to solve all Atari game environments given an image input. As such, we aren’t able to assign more precise rewards, and the reward system used for Pong is either -1 or 1 for scoring a point. This means there are far fewer rewards the agent learns from in each play-through, and there are tons of frames in between each reward which makes it harder for the agent to grasp the future utility of an action.

After a sanity check using the CartPole environment with similar hyperparameters as the last post, we had the following results:

Phew. At least the algorithm does work and the agent will learn in an easier environment. Comparing this to a baseline DQN implementation detailed in the 2013 paper, we had gotten similar results, with an average score of 476 over 10 episodes.

Overall, this was a fun experience and I got to go more in-depth about the DQN algorithm, adding in Target Networks and a naive Double Q Learning implementation. There are still improvements that can be made such as prioritized sweeping and Dueling Networks, but that's to be explored in the future. Although DQN was the first model used for solving Atari games with image inputs, the fact remains that DQN does have a long training time and slow convergence rate. In the future, I will be exploring other kinds of algorithms such as Deep Deterministic Policy Gradient (DDPG) for continuous actions, Proximal Policy Optimization, A3C, and more!

My code: https://github.com/chengxi600/RLStuff/blob/master/Q%20Learning/Atari_DQN.ipynb

Sources:

--

--