Proximal Policy Optimization Tutorial (Part 2/2: GAE and PPO loss)

Let’s code from scratch a Reinforcement Learning football agent!

DG AI Team

Published in

deepgamingai

6 min readJun 30, 2020

Part 1 link: Proximal Policy Optimization Tutorial (Part 1: Actor-Critic Method)

Welcome to the second part of the Reinforcement Learning math and code tutorial series. In the first part of this series, we saw how to setup the Google Football Environment and then implemented an Actor-Critic model framework to interact with and collect sample experiences from this game environment.

Today, we will complete the rest of the tutorial by using the batch of sample experiences to train our model to score some goals in the game. In reference to this code we implemented last time, recall that we have collected the following information so far:

Using this information,we can now go ahead and calculate advantages.

Generalized Advantage Estimation (GAE)

Advantage can be defined as a way to measure how much better off we can be by taking a particular action when we are in a particular state. We want to use the rewards that we collected at each time step and calculate how much of an advantage we were able to obtain by taking the action that we took. So if we took a good action like shoot towards a goal, we want to calculate how much better off we were by taking that action, not only in the short run but also over a longer period of time. This way, even if we do not immediately score a goal in the next time step after shooting, we still look at few time steps after that action into the longer future to see if we scored a goal.

In order to calculate this, we’ll use an algorithm known as Generalized Advantage Estimation or GAE. So let’s take a look at how this algorithm works using the batch of experiences we have collected.

Generalized Advantage Estimation Algorithm

Here, a mask value m is used because if the game is over then the next state in our batch will be from a newly restarted game so we do not want to consider that and therefore mask value is taken as 0.
Gamma γ is nothing but a constant known as discount factor in order to reduce the value of the future state since we want to emphasize more on the current state than a future state. Consider this as scoring a goal in present is more valuable than scoring a goal in future, hence we discount future goal so that we can put more value to a present goal.
Lambda λ is a smoothing parameter used for reducing the variance in training which makes it more stable. The value of this smoothing parameter suggested in the paper is 0.95. Hence, this gives us the advantage of taking an action both in the short term and in the long term.
In the last step, we simply reverse the list of Returns since we were looping from the last time step to the first, so that we obtain back the original order.

This is basically the GAE algorithm that can be implemented in our code as follows.

Here’s a line-by-line explanation of this algorithm in the video below.

We now have everything we need in order to train our actor and critic models. So we will see how to use this information to calculate a custom PPO loss and use that loss for training the actor model.

Custom PPO loss

This is the most important part of the Proximal Policy Optimization algorithm. So let’s first understand this loss function.

Recall that π indicates the policy that is defined by our Actor neural network model. By training this model, we want to improve this policy so that it gives us better and better actions over time. Now a major problem in some Reinforcement Learning approaches is that once our model adopts a bad policy, it only takes bad actions in the game, so we are unable to generate any good actions from there on leading us down an unrecoverable path in training. PPO tries to address this by only making small updates to the model in an update step, thereby stabilizing the training process. The PPO loss can be calculated as follows.

PPO uses a ratio between the newly updated policy and old policy in the update step. Computationally, it is easier to represent this in the log form.
Using this ratio we can decide how much of a change in policy we are willing to tolerate. Hence, we use a clipping parameter epsilon ε to ensure we only make the maximum of ε% change to our policy at a time. The value of epsilon is suggested to be kept at 0.2 in the paper.
Critic loss is nothing but the usual mean squared error loss with the Returns.
We can combine the actor and critic losses if we want using a discount factor to bring them to the same order of magnitude. Adding an entropy term is optional, but it encourages our actor model to explore different policies and the degree to which we want to experiment can be controlled by an entropy beta parameter.

This custom loss function can be defined with Keras using the following code.

Here’s the line-by-line explanation and implementation of this custom loss function in the video embedded below.

Model Training and Evaluation

Now we can finally start the model training. For this, let’s use the fit function of Keras as follows.

You should now be able to see on your screen the model taking different actions and collecting rewards from the environment. At the beginning of the training process, the actions may seem fairly random as the randomly initialized model is exploring the game environment.

Ok, now lets implement some model evaluation code. This will tell us during model training how good the last updated version of the model is terms of successfully scoring a goal. So to evaluate that, we’ll calculate average reward, defined as the mean of all the rewards we obtain by playing the game from scratch multiple times. If we score a goal in let’s say 4 out of 5 games, our average reward will be 80%. This can be implemented as follows.

The testing phase will look like the following once the model begins to learn which set of actions produce the best long-term rewards. In our case, hitting the ball to the right is observed to have produced the best rewards, hence our Actor model will produce the right direction and shoot actions as it’s preferred output actions.

The rest of the code used to tie everything together can be found here in the train.py script of this GitHub repository.

ChintanTrivedi/rl-bot-football

This code implements a bare-bones version of the Proximal Policy Optimization (PPO) algorithm for the purpose of…

github.com

If you want to learn this implementation with line-by-line explanation, you may watch the video below.

Conclusion

I hope this tutorial could give you a good idea of the basic PPO algorithm. You can now move onto building upon this by executing multiple environments in parallel in order to collect more training samples and also solve more complicated game scenarios like full 11-vs-11 mode or scoring from corner kicks. Some useful references that can help with that, and also which I used for this particular tutorial, can be found here and here. Best of luck!

Thank you for reading. If you liked this article, you may follow more of my work on Medium, GitHub, or subscribe to my YouTube channel.

Note: This is a repost of the article originally published with towardsdatascience in 2019.