Understanding Proximal Policy Optimization (PPO) and its implementation on Mario Game Environment

Explained the concept and idea behind Proximal Policy Optimization and its implementation on Mario gym environment

Published in

Analytics Vidhya

5 min readAug 12, 2019

Reinforcement learning is basically divided into two categories namely Policy gradients and Value Functions, each having their own pros and cons. In this post, we shall talk about the state of the art policy optimization technique namely PPO or Proximal Policy Optimization.

A quote from OpenAI on PPO:

Proximal Policy Optimization (PPO), which perform comparably or better than state-of-the-art approaches while being much simpler to implement and tune.

Before diving into the details of PPO we need to understand a few things and among those things is the concept of surrogate function which will help us understand the motive behind using PPO.

Surrogate Function:

Surrogate function can be termed as an approximate form of the gradient which is more like the gradient of a new object.

We use this innovative gradient so that we can perform gradient ascent to update our policy and which can be considered as directly maximizing the surrogate function.

Surrogate function helps achieve an optimal policy (Image from Udacity Deep Reinforcement Learning nanodegree)

But using the Surrogate function still leaves us with a problem that if we keep reusing past trajectories and simultaneously keep updating our policy we will see that at some point the new policy might sway away enough from the old one such that all the previous approximations we made with the surrogate function could just become invalid. This is where the real advantage of using PPO comes into place.

Clipping the Surrogate Function:

This is where the three words:

Policy( the optimal policy to be achieved)
Proximal( clipping of the surrogate function)
Optimization( use of the surrogate function) come into the light and their actual meaning which leads to the naming of the algorithm.

Clipping of the surrogate function(Image from Udacity Deep Reinforcement Learning nanodegree)

Clipping of the surrogate function is flattening it so that it makes convergence to the optimal policy easier and convenient. Under this clipping when we start applying gradient ascent to our current policy, the update remains the same as it happens in a normal surrogate function but the update stops when we hit the plateau. Promptly, because the reward function is flat and the gradient is zero which directly imply that the policy update will stop and our optimal policy will be achieved.

After understanding that such a complex reinforcement learning algorithm could be understood so easily, mind definitely equals blown.

We also have the code implementation on the Mario environment, so keep steady and be focused.

Installing and Running Mario Environment

The following command will help you install the Super Mario bros environment-

pip install gym-super-mario-bros

This snippet will help you render the env and let you play with it and get used to the action and state-space-

For more details on the environment refer to this.

Coding PPO for Super Mario Bros

For convenience, we will be using the baseline given by OpenAI since they have a huge collection of RL algorithms and keep updating their GitHub repositories.

Note on pip usage- pip for Python 2 and pip3 for Python3

We will first download the required packaged and then the Baselines repository for RL codes-

sudo apt-get install zlib1g-dev libopenmpi-dev ffmpeg
sudo apt-get update

pip3 install opencv-python cmake anyrl gym-retro joblib atari-py git clone https://github.com/openai/baselines.git

Install Tensorflow ( CPU or GPU as per your requirements)

pip install tensorflow-gpu # for GPUpip install tensorflow # for CPU

Finally, install the baseline packages-

cd baselinespip3 install -e .

The syntax code for using the RL codes given in the baselines is always like this-

python -m baselines.run --alg=<name of the algorithm> --env=<environment_id> [additional arguments]

For example, if we have wanted to train a fully-connected network controlling MuJoCo humanoid using PPO2 for 20M timesteps we shall write it as follows-

python -m baselines.run --alg=ppo2 --env=Humanoid-v2 --network=mlp --num_timesteps=2e7

After this make sure your gym-retro and atari-py have been successful. For more information on these refer to RETRO and ATARI.

Note — For running the Mario directly with baseline codes as a gym environment please do so:
In order to import ROMS, you need to download Roms.rar from the Atari 2600 VCS ROM Collection and extract the .rar file. Once you've done that, run:
python -m atari_py.import_roms <path to folder>
This should print out the names of ROMs as it imports them. The ROMs will be copied to your atari_py installation directory.

When you almost are done with the installations and suddenly some error comes up.

Now to begin the training we use the following command-

python3 -m baselines.run --alg=ppo2 --env=SuperMarioBros-Nes 
--gamestate=Level3-1.state --num_timesteps=1e7

For saving the model during the training add the following parameter at the end and the same goes for loading the model after training-

--save_path=./PATH_TO_MODEL--load_path=./PATH_TO_MODEL

There you go you can begin training you Mario to rescue the princess.

Understanding Proximal Policy Optimization (PPO) and its implementation on Mario Game Environment

Explained the concept and idea behind Proximal Policy Optimization and its implementation on Mario gym environment

Installing and Running Mario Environment

Coding PPO for Super Mario Bros

Written by Ujwal Tewari