Proximal Policy Optimization Tutorial (Part 1/2: Actor-Critic Method)

Let’s code from scratch a Reinforcement Learning football agent!

DG AI Team
deepgamingai
7 min readJun 30, 2020

--

Welcome to the first part of a math and code turorial series. I’ll be showing how to implement a Reinforcement Learning algorithm known as Proximal Policy Optimization (PPO) for teaching an AI agent how to play football/soccer. By the end of this tutorial, you’ll get an idea on how to apply an on-policy learning method in an actor-critic framework in order to learn navigating any game environment. We shall see what these terms mean in context of the PPO algorithm and also implement them in Python with the help of Keras. So, let’s first start with the installation of our game environment.

Note: The code for this entire series is available in the GitHub repository linked below.

Setting up the Google Football Environment

Google Football Environment released for RL research

I’m using the Google Football Environment for this tutorial but you can use any game environment, just make sure it supports OpenAI’s Gym API in python. Please note that the football environment currently only supports Linux platform at the time of writing this tutorial.

Start by creating a virtual environment named footballenv and activating it.

>> virtualenv footballenv
>> source footballenv/bin/activate

Now install the system dependencies and python packages required for this project. Make sure you select the correct CPU/GPU version of gfootball appropriate for your system.

>> sudo apt-get install git cmake build-essential libgl1-mesa-dev 
libsdl2-dev libsdl2-image-dev libsdl2-ttf-dev libsdl2-gfx-dev libboost-all-dev libdirectfb-dev libst-dev mesa-utils xvfb x11vnc libsqlite3-dev glee-dev libsdl-sge-dev python3-pip
>> pip3 install gfootball[tf_gpu]==1.0
>> pip3 install keras

Running the Football Environment

Now that we have the game installed, let’s try to test whether it runs correctly on your system or not.

A typical Reinforcement Learning setup works by having an AI agent interact with our environment. The agent observes the current state of our environment, and based on somepolicy makes the decision to take a particular action. This action is then relayed back to the environment which moves forward by one step. This generates a reward which indicates whether the action taken was positive or negative in the context of the game being played. Using this reward as a feedback, the agent tries to figure out how to modify its existing policy in order to obtain better rewards in the future.

Typical RL agent

So now let’s go ahead and implement this for a random-action AI agent interacting with this football environment. Create a new python file named train.py and execute the following using the virtual environment we created earlier.

This creates an environment object env for the academy_empty_goal scenario where our player spawns at half-line and has to score in an empty goal on the right side. representation='pixels' means that the state that our agent will observe is in the form of an RGB image of the frame rendered on the screen. If you see a player on your screen taking random actions in the game, congratulations, everything is setup correctly and we can start implementing the PPO algorithm!

Here are the same installation steps in a video format if that’s more your thing.

Proximal Policy Optimization (PPO)

The PPO algorithm was introduced by the OpenAI team in 2017 and quickly became one of the most popular RL methods usurping the Deep-Q learning method. It involves collecting a small batch of experiences interacting with the environment and using that batch to update its decision-making policy. Once the policy is updated with this batch, the experiences are thrown away and a newer batch is collected with the newly updated policy. This is the reason why it is an “on-policy learning” approach where the experience samples collected are only useful for updating the current policy once.

The key contribution of PPO is ensuring that a new update of the policy does not change it too much from the previous policy. This leads to less variance in training at the cost of some bias, but ensures smoother training and also makes sure the agent does not go down an unrecoverable path of taking senseless actions. So, let’s go ahead and breakdown our AI agent into more details and see how it defines and updates its policy.

The Actor-Critic Method

We’ll use the Actor-Critic approach for our PPO agent. It uses two models, both Deep Neural Nets, one called the Actor and other called the Critic.

PPO Agent

The Actor model

The Actor model performs the task of learning what action to take under a particular observed state of the environment. In our case, it takes the RGB image of the game as input and gives a particular action like shoot or pass as output.

The Actor model

Let’s implement this first.

Here, we are first defining the input shape state_input for our neural net which is the shape of our RGB image. n_actions is the total number of actions available to us in this football environment and will be the total number of output nodes of the neural net.

I’m using the first few layers of a pretrained MobileNet CNN in order to process our input image. I’m also making these layers’ parameters non-trainable since we do not want to change their weights. Only the classification layers added on top of this feature extractor will be trained to predict the correct actions. Let’s combine these layers as Keras Model and compile it using a mean-squared error loss (for now, this will be changed to a custom PPO loss later in this tutorial).

The Critic model

We send the action predicted by the Actor to the football environment and observe what happens in the game. If something positive happens as a result of our action, like scoring a goal, then the environment sends back a positive response in the form of a reward. If an own goal occurs due to our action, then we get a negative reward. This reward is taken in by the Critic model.

The Critic model

The job of the Critic model is to learn to evaluate if the action taken by the Actor led our environment to be in a better state or not and give its feedback to the Actor, hence its name. It outputs a real number indicating a rating (Q-value) of the action taken in the previous state. By comparing this rating obtained from the Critic, the Actor can compare its current policy with a new policy and decide how it wants to improve itself to take better actions.

Let’s implement the Critic.

As you can see, the structure of the Critic neural net is almost the same as the Actor. The only major difference being, the final layer of Critic outputs a real number. Hence, the activation used is tanh and not softmax since we do not need a probability distribution here like with the Actor.

Now, an important step in the PPO algorithm is to run through this entire loop with the two models for a fixed number of steps known as PPO steps. So essentially, we are interacting with our environemt for certain number of steps and collecting the states, actions, rewards, etc. which we will use for training.

Tying it all together

Now that we have our two models defined, we can use them to interact with the football environment for a fixed number of steps and collect our experiences. These experiences will be used to update the policies of our models after we have a large enough batch of such samples. This is how to implement the loop collecting such sample experiences.

As you can see in the code above, we have defined a few python list objects that are be used to store information like the observed states, actions, rewards etc. when we are interacting with our environment for a total of ppo_steps. This gives us a batch of 128 sample experiences that will be used later on for training the Actor and Critic neural networks.

Following two videos explain this code line-by-line and also show how the end result looks like on the game screen.

To be continued…

That’s all for this part of the tutorial. We installed the Google Football Environment on our Linux system and implemented a basic framework to interact with this environment. Next, we defined the Actor and Critic models and used them to interact with and collect sample experiences from this game. Hope you were able to keep up so far, otherwise let me know down below in the comments if you were held up by something and I’ll try to help.

Next time we’ll see how to use these experiences we collected to train and improve the actor and critic models. We’ll go over the Generalized Advantage Estimation algorithm and use that to calculate a custom PPO loss for training these networks. So stick around!

EDIT: Here’s PART 2 of this tutorial series.

Thank you for reading. If you liked this article, you may follow more of my work on Medium, GitHub, or subscribe to my YouTube channel.

Note: This is a repost of the article originally published with towardsdatascience in 2019.

--

--

DG AI Team
deepgamingai

AI, ML Research for Game Development. Official account for the publication medium.com/deepgamingai.