Part 6 — Double Duelling Q Network with Experience Replay

4 min readJun 6, 2018

In the previous part we discovered that just being a bit less greedy in our action policy was not enough. It may or may not have helped a bit in some cases, but it certainly didn’t solve the problem.

Maybe it’s time to look at more sophisticated network topologies. Let’s look at what the experts do for some guidance: For me this means this excellent blog post by Arthur Juliani: Simple Reinforcement Learning with Tensorflow Part 4: Deep Q-Networks and Beyond.

I encourage you to read the article yourself, but here is a summary of what he does there that we currently don’t:

Convolutional Layers
Experience Replay
Pre-training games
Separate Target Network (Double Q Network)
Duelling Q Network

Convolutional Layers

Convolutional network layers are very powerful for processing and analysing visual images as they process and identify patterns in stacks of 2D input data arrays — generally different colour bands of 2D raster images.

Given the inherent 2D nature of a Tic Tac Toe board, this might help quite a lot.

Experience Replay

By storing previous games in an experience buffer and reusing this past experience when training the network we can increase the stability of the training. We avoid the network getting stuck in patterns where it keeps playing the same bad moves over and over again with little change to outcomes or learning. In particular, my hope is that this will help us in cases where positive experiences are very rare, e.g. playing second against the Min Max Player.

Pre-training games

This is a set of games that are played before training begins, where we chose moves completely randomly. This gives us a reasonably broad experience and sample of possible outcomes. If we start directly with a network based move policy we have the risk that random weight initialisation creates a policy which is stuck in a local minimum and very unlikely to allow us to explore a wide set of actions — even using a ϵ-greedy strategy.

Separate Target Network

According to the experts, using the same Q-Network for a training as well as the target Q function for computing the loss can become unstable and spiral out of control. By separating the target into an independent second network, training should become more stable.

Our current topology looks like this, with the current state being s, the state after move a being s′, the reward of the current move being r, and the discount value for future expected rewards being γ:

After adding a target network it will look like this:

Every now and then we update the target network by copying the weights from the main network to it.

Duelling Q Network

Currently the network will produce an absolute quality value for every move. A potential improvement over this approach is to put the value of a move in relation to the value of the current board state. I.e. rather than looking at the absolute Q value of a move we ask how much does a move improve or decrease the value of the current board state. Or in other words: How much better or worse is a move compared to the other possible moves in this state. We call this the Advantage of the move. I.e.

Q(s,a)=V(s)+A(a)

where Q(s,a) is the Q value of move a in state s, V(s) is the Q value of state s, and A(a) is the Advantage of move a.

The Duelling Q Network would look like this:

Putting it all together

If we wanted to be methodical about this, we should try each of those potential improvements independently and see if and how much they actually improve things. Then all combinations. Nobody has time for that, so we add Experience Replay, Pre-training games, a separate target network, and the Duelling Q Network topology all at once. We don’t quite do Convolutional Layers yet, however. The code for this is in ExpDoubleDuelQPlayer.

Let’s take it for a spin, starting with the easy case — Going first against the RandomPlayer and then trying our luck against the other suspects:

Better than before, but still not perfect. I seems to get something around 80 +/- 15 % draws in the end. Not the 100% we were looking for.

To summarize:

Player      |   NN Player 1st       | NN Player 2nd 
==============================================================
Random      | Improved, not perfect | Improved, not perfect
Min Max     | Seems to work         | Seems to work
Rnd Min Max | Seems to work         | Better, but not perfect

In the next part we will add convolutional network layers and see if this helps us getting the last few missing percent.

The other parts of this series can be found here:

The source code and Jupyter notebooks for all parts of this series are available at GitHub.