Days 11–14 of the OpenAI Retro Contest
Digging into the PPO2 baseline code.
It didn’t take long for my eyes to start glazing over when trying to go through the Rainbow DQN baseline implementation for the OpenAI Retro Contest. The PPO2 code seems to be a bit easier to understand, so it seems like a better stepping stone in my journey of AI retro gaming understanding.
I felt like my line by line approach in understanding was part of what did me in on rainbow code, so for this baseline code I am going to focus on the execution path, and less on things like import statements.
Background Reading
I started off by reading through the OpenAI paper on Proximal Policy Optimization Algorithms. The paper is pretty equation heavy with things like gradient estimators that seem more useful for theory than practice.
In fact, the equations seem to go on for the entire paper, so I set out to find something better to teach me what policy gradients were all about. Scholarpedia gave a definition of Policy Gradient Methods:
Policy gradient methods are a type of reinforcement learning techniques that rely upon optimizing parametrized policies with respect to the expected return (long-term cumulative reward) by gradient descent.
The first thing I had to look up in that definition was gradient descent, but that seems pretty easy to understand as a method for getting to optimal values most quickly (or steeply). To learn more about what a “policy” meant in this context I found these slides (and video) from David Silver’s course at University College London/Google. Policies seem to be generalized rules for obtaining rewards in a given system, and gradient descent is the strategy to find the best polices/parameters most quickly.
The Code
Now that I kind of know what the code was supposed to be doing, I am eager to see what the code looks like especially around policy discover/creation. The main function in the ppo2-agent.py was pretty concise:
"""Run PPO until the environment throws an exception."""
I am guessing that these exceptions, that the code can run into, could take the form of a traditional code exception or error, some kind of maximum number of runs, or perhaps even creating a model that is deemed “successful enough”.config = tf.ConfigProto()
This appears to configure TensorFlow to use the current system hardware from skimming the definition, and I imagine that it is a common piece of boilerplate for most Tensorflow code.config.gpu_options.allow_growth = True
Here the above config is overridden (or perhaps ensured to be set) to allow the gpu memory to be allocated progressively over time instead of being allocated all at once in the beginning. I might have to configure these differently to run locally with my macbook’s cpu.with tf.Session(config=config):
I wasn’t that familiar withwith
but this handy answer on StackOverflow cleared things up for me. It seems like a really useful language feature. Here we are running the following code after running the__enter__()
code for TensorFlow and then when the indented code is finished it will run TensorFlow’s__exit__()
function. I think that code is here and mainly deals with establishing a session.
The rest of the code is configuration for one method called ppo2.learn()
and all of the different arguments it takes. From the import statements at the top, the source code for the function signature can be found here. I’ll go through the arguments to try and understand what is going on.
policy=policies.CnnPolicy
This is another piece of code mentioned in the imports from the same repo as the rest of the ppo2 implementation. Something to note about this object is thestep()
andvalue()
functions. I imagine that they are necessary for the training as the other policies incorporate them as well.env=DummyVecEnv([make_env])
Still from the same baseline code. I couldn’t find anything that explained what the “VecEnv” might stand for in OpenAI’s PPO paper, so it must be more general (but surely seems like VectorEnvironment). Looking back at the definition forppo2.learn()
I came to the conclusion that it must be an “environment” that contains an “observation space” and an “action space”. Is this the same as the game environments that I have been using to watch and run the Sonic rom? I took a peek atsonic_util.py
since that is where themake_env
comes from and the rest of the code for VecEnv, and discovered that indeed, it appears to be a “Vectorized Environment” and this dummy function initializes most of the data as a bunch of zeros.nsteps=4096
This code seems the most obvious yet! From this linensteps
determines how many times the model runs it’sstep()
function, which appears to be how many times the policy’sstep()
function (in our casepolicies.CnnPolicy.step()
) gets ran.nminibatches=8
Reading into the code there seems to be some interplay between the number of steps, batches, and environments created.nminibatches
seems to primarily adjust the number of, or maybe proportion of training that is done, and must divide into the number of environments multiplied bynsteps
evenly.
lam=0.95
I started out thinking that this value was the learning rate, until I saw thelr
parameters further down. The paper refers to it as the GAE parameter, but I think the best description I found was with was from Tom Breloff:
The hyperparameter γ allows us to control our trust in the value estimation, while the hyperparameter λ allows us to assign more credit to recent actions.
gamma=0.99
Tom’s summary also seems like a good explanation for gamma as well. I’ll have to play around with these to see what impact they have.noptepochs=3
Based on this line I’m guessing that this is the number epochs and each epoch is some kind of larger iteration factor. It doesn’t seem to be as linked to other variables as the number of steps.log_interval=1
Only found in one line, it looks like this controls how many updates to make before logging an output.ent_coef=0.01
Looking here, this appears to be a factor that is multiplied by theentropy
, some value that TensorFlow spits out. I’m guessing that it impacts the amount of random influence on the game.lr=lambda _: 2e-4
This is surely the learning rate, interesting that it is represented as a function instead of the value directly, perhaps it offers more flexibility for dynamic learning rates?cliprange=lambda _: 0.1
Another function, the clip range puts limits on the gradient for gradient descent.total_timesteps=int(1e7)
This looks like the total number of events the network goes through, and is the dividend when grouping into batches.
Well that was quite the research into the function signature and the rest of the PPO2 agent code. I haven’t been able to get the code up and running yet (I’m on some slow wifi and can’t build my Docker containers. Next up for team Bobcats, I think we will try to get this agent up and running and then try out some hyperparameter tuning.
Thanks for reading! You might be interested in the rest of this series:
- Day 1: Getting the Basics Set Up
- Day 3: Running the Jerk Agent
- Days 4 & 5: Getting TensorFlow & Docker to work on my MacBook
- Day 6: Playback Tooling for
.bk2
files - Days 9 &10: Failing with the Rainbow DQN baseline code.
- Days 11–14: Reading the PPO2 code
- Days 16–18: Running the PPO2 baseline code, and failing at TensorFlow & Docker optimization.
- Days 22–25: A Deep Dive into the Jerk Agent
- Days 26–29: Visualizing batches of sonic runs
- Days 38–53: Discovering Q-Learning
- My final submission: the improved JERK agent