Using Joint PPO with Ray

AurelianTactics
aureliantactics
Published in
2 min readAug 5, 2018
Multiple Sonic environments. From Gotta Learn Fast, the paper describing Joint PPO

Joint PPO is a modification of Proximal Policy Optimization (PPO). Joint PPO was used by the winner of OpenAI’s Retro Contest. Joint PPO in a few lines:

During meta-training, we train a single policy to play every level in the training set. Specifically, we run 188 parallel workers, each of which is assigned a level from the training set. At every gradient step, all the workers average their gradients together, ensuring that the policy is trained evenly across the entire training set.

Joint PPO is a distributed PPO variant with the distinction that workers don’t all train on the same environment. In the above paper, workers train on a variety of Sonic levels. Here’s a github repo for someone implementing Joint PPO with OpenAI’s baselines PPO algorithm and mpi (blog post write up here).

Since the high performance Reinforcement Learning (RL) library called ray has a PPO algorithm implementation and excels at distributed execution, adding Joint PPO is a natural extension of ray. I asked about using Joint PPO on ray and within hours one of the developers of ray was kind enough to explain how to implement it and provide instructions for those who want to use multi-environment set ups with ray (see the ‘Configuring Environments’ section).

As readers of this blog would expect, I’ve created a script that allows you to use Joint PPO with OpenAI’s Retro Gym (I promise I’ll write about other environments and algorithms soon). I modified the sonic-on-ray repo from OpenAI. The key part is here:

In ray, you can set the number of workers in the configuration (example: ‘num_workers’: 4). The MultiEnv class takes the config file and based on the worker index or vector index ( “num_envs_per_worker” argument) will assign the desired environment to them. In this Bust A Move example, I assigned the workers one of the 5 Challenge Play states like ‘BustAMove.Challengeplay0’ or ‘BustAMove.Challengeplay4’, etc.

Challenge Play is a mode where you try to play as long as you can. However the way the game randomizes the starting configuration of the level is from the main menu selection and not the loading of a save state. Thus I can train a RL agent to perform okay on one specific state (100 bubbles popped) but the performance doesn’t generalize to other starting configurations. The agent overfits to that one state. The hope is that by using Joint PPO on a variety of starting states, a more robust policy can be learned.

--

--