Part 8 — Tic Tac Toe with Policy Gradient Descent

7 min readJul 20, 2018

Do you really need values if you have good policies?

The aim of this part is to try a different Reinforcement Learning approach. Instead of learning the Q function and then base our action policy on those Q values, we will try to learn a good policy directly.

Value based learning recap

The general approach we have used in the previous parts is a classic case of value based learning:

We know that there is a value function, the Q function, which can tell us exactly how good a certain move is in a certain state.
We train a Neural Network to learn this Q function.
When we have to make a move, we do so according to the following policy: We look at the Q values of all currently possible moves and chose the one with the highest Q value.

This, in the end, seemed to work reasonably well. However there are a few disadvantages to this approach:

We do not know the Q function when we start training. This means we are actually trying to do 2 things at the same time: a.) Learn what the Q function is, and b.) Train the Neural Network to emulate the Q function
This means, we are not actually training our Neural Network against the real Q function, but rather against our current best guess what the Q function might be. If this best guess is not very accurate, the Neural Network will not perform very well, no matter how well it can emulate it.
There are circumstances where, figuratively speaking, it’s quite hard to determine how good a given situation is, however comparatively easier to determine what the best action is. In such cases, trying to learn the Q function to determine the action might be more complicated than necessary.

Direct Policy Learning

The idea behind Direct Policy Learning based methods is to learn the best action policy directly without an explicit value function. We will do this by repeatedly playing the game and

rewarding moves that led to a positive outcome by increasing the probability that they are chosen, while
discouraging moves that led to a negative outcome by decreasing their probability.

In contrast to the Q learning approach we do not care how the Neural Network comes up with the policy. There may be a network layer which by chance computes something very similar to the Q function, but then again it may not. All we care about is that the Neural Network in the end outputs a policy, i.e. tells us which action to take to maximise our chance of a positive final outcome.

Deterministic policy vs stochastic policy

A policy can either be deterministic or stochastic. A deterministic policy will tell us explicitly which action to take in a given situation. A stochastic policy in contrast will tell us for a given state and action with what probability we should take this action (with the sum of the probabilities of all actions in a given state adding up to 1).

Why would we ever need a stochastic policy? Shouldn’t a deterministic policy be just fine as it tells us exactly what the best move is? Why would we ever take an action which is not the best?

Turns out, in some scenarios, being able to choose based on probabilities is essential. One extreme case in the regards is the game Rock — Paper — Scissors. There is no good deterministic strategy for this game, it is absolutely essential that you chose your action stochasticly to be successful.

In our case, being stochastic has the benefit that it allows our Neural Network to explore options while it learns the best policy. A deterministic policy is much more likely to get stuck in a local minimum.

For the rest of this part, we will focus on stochastic policies.

We will denote policies with π(s,a). Where π(s,a) will stand for the suggested probability of taking action a in state s under this policy.

Parameterized policies

Instead of talking about finding the best policy π, we will instead talk about finding the best parameters ϕ for our policy π. That is, the policy is implemented by the Neural Network and the parameters that drive the policy are the learn-able weights in the network. We write this parameterized policy as πϕ.

Quality of policies

We previously mentioned that we want to find a policy that is good, without defining what exactly we mean by good. This is a bit too vague for our purposes and we need to tie this down: We define the quality of a policy as the reward we can expect to receive if we always take actions according to the policy. The reward is a single number, the higher, the better. We write the quality of a policy with parameter ϕ as:

J(ϕ)=Eπ[G]

, where Eπ[G=γ R1+γ^2 R2+…γ^n Rn] is the expected total (= discounted accumulated) reward from here until the end of the game.

We can then define a policy πϕ1 to be better than a policy πϕ2 if it has a higher expected reward when applied to the same starting state.

Finding the best policy — Policy Gradient Descent

Having defined the quality of a policy and a way of comparing two policies, finding the best policy is now an optimisation problem. Ideally, we’d want to find the gradient of the policy and follow it to its maximum.

So, how do we compute the policy gradient?

If you really want to know, the answer is rather mathematical and technical.

A lot of people have already written very good articles about it and if you are interested, have a look at

An introduction to Policy Gradients with Cartpole and Doom by Thomas Simonini.
Lecture 7: Policy Gradient Descent by David Silver.
Notes on policy gradients and the log derivative trick for reinforcement learning by David Meyer.
or various other places on the internet.

We’ll jump straight to the result. It turns out, the derivative of the quality of the policy is the same as the derivative of the log probability of taking an action according to our policy multiplied by its expected reward:

∇ϕJ(ϕ)=Eπ[∇ϕlog(π(s,a))G]

All of these things we have available: The probability π(s,a) is what our Neural Network will output. It is further differentiable and TensorFlow can compute the derivative for us. We can sample the expected reward G by repeated play and recording the observed rewards G=γ R1+γ^2 R2+ … + γ^n Rn— i.e. the discounted accumulated reward from the current state to the end of the game — similar to how we did this in Q learning.

This means we have all we need to implement this.

Policy Gradient in TensorFlow

We can recycle most of the code we used to build the Neural Network from our Q-learning efforts. We will reuse the non-convolutional network but with added weight regularization loss. You can find the complete code in the class DirectPolicyAgent.py which, like most of the previous code, is closely based on the work by Arthur Juliani. The only parts we need to change from our previous network code are:

The loss function.
What we feed into the loss function. In particular we will feed sample rewards instead of Q value estimates.
A small change to the reward function.

The new loss function and training input

Let’s start with the loss function. We now use:

Apart from the output of the Neural Network in self.output we provide as input which action was chosen in self.action_holder and the discounted reward for that action we observed in self.reward_holder.

We also add a small constant, 1e−9, to the output probabilities to avoid the logarithm getting too close to infinity (log(0)=−∞).

Finally, we multiply the loss by -1 to account for the fact that the TensorFlow Optimizer will try to minimize the function rather than maximise it as would be required per the previous section.

The new reward function

Previously we have used the following values / rewards:

Winning action: 1
Losing action: -1
Draw action: 0
Other, non game ending, action: Discounted reward γ^t∗finalReward with t being the number of moves the action happened before the end of the game and finalReward being the reward of the final move of the game.

This is nicely intuitive and symmetric, and in theory, I believe, it should work fine. In practice, however, I found it does not work so well for our example. I suspect this is because for negative rewards, the minimum (and the gradient) of the loss function would go to -∞. This would be highly unsymmetrical compared to a positive reward and also probably numerically rather unstable. So instead we use:

Winning action: 1
Losing action: 0
Draw action: 0.5
Other, non game ending, action: Discounted reward γ^t∗finalReward with t being the number of moves the action happened before the end of the game and finalReward being the reward of the final move of the game.

Note, that in this case all actions that lead to a loss are effectively completely ignored. As the reward and discounted reward in this case is 0 — which multiplied with the log probabilities is still always 0 — the loss will always be 0 with 0 gradient and no chance of learning anything. We thus completely rely on draws and wins to give the necessary signal to learn. It’s not quite clear to me how to better factor in losses as, for the reasons mentioned before, negative rewards also do not seem to be the right way.

Time to take it for a spin

Some test runs for the Policy Gradient Descent Tic Tac Toe player

Conclusion

We have introduced Policy Gradient Descent as an alternative method to Q learning as a Reinforcement Learning method. We have also seen how to implement it in TensorFlow in our Tic Tac Toe example.

The results are quite solid, in particular keeping in mind that we did not use Convolutional Layers so far. We don’t have a direct comparison with Q learning we also used Loss Regularisation which we didn’t use in Q-learning without Convolutional layers. So there is a chance that it’s really the Loss Regularisation that made all the difference, but I doubt it.

Ways to take this further would be:

Add Convolutional layers to the Policy Gradient Descent method.
Try Actor Critic Methods.

I might get around to trying this in the future. Or, as always, feel encouraged to do it and report back!

The other parts of this series can be found here:

The source code and Jupyter notebooks for all parts of this series are available at GitHub.