Using Ray for Reinforcement Learning

AurelianTactics
aureliantactics
Published in
3 min readAug 4, 2018
Population Based Training from DeepMind

I’ve been exploring ray for Reinforcement Learning (RL) the past couple of weeks. ray provides ‘high-performance distributed execution engine’ and comes with RLlib (Scalable Reinforcement Learning) and Ray Tune (Hyperparameter Optimization Framework). Ray contains RL algorithms Ape-X (DDPG and DQN), PPO, A3C, ES, and PG. You can train these algorithms on gym and Atari environments (see the examples in the package) or with some modifications train your custom environment (like Retro Gym).

I did some simple training runs to get a feel for ray. A few things of note:

  • While ‘pip install ray’ works, I have been using the git clone installation. Since the package is frequently being updated I had a couple instances where the pip installed version was clashing with example scripts.
  • ray has a jupyter notebook script that shows CPU and cluster usage for your algorithm.
  • ray outputs results to tensorboard. Type ‘tensorboard — logdir=~/ray_results’ to visualize your results. Hyperparameter tuning comes with more visualization options.
  • The train.py script lets you train algorithms with customization options like number of workers, CPU and GPU per worker, and a variety of algorithm specific arguments. After training (and selecting the save to checkpoint option) the rollout.py script lets you load the trained model and see the model render as it performs.
  • ray can also be used for more general Machine Learning usage. Some of the examples how hyperparamter tuning on classic ML problems like MNIST and CIFAR10

What I was most interested in using ray for was hyperparameter tuning. A necessary yet dull and time consuming process is made much easier with ray (once I got figured out the tuning algorithms and what the arguments did). I did some simple hypertuning experiments to find hyperparameters for a Retro Gym game called Bust AMove. Unfortunately I had some issues with getting good results on the game. I could only get to level 8 out of 99. I think the scripts I created below can be modified and useful for other RL attempts.

Grid Search

Example scripts: Sonic, Bust A Move, MNIST

In grid search you set all options for each hyperparameter that you want to test in a separate list. Then when the experiment runs, every possible combination of those hyperparameter lists is run. Grid search is slow and the combinations grow rapidly. Based on lectures I’ve seen (Stanford vision class and the Berekely Deep RL class) and what I’ve read (like the Deep Learning book), it is regarded as inferior to random search. I made a grid search script but didn’t actually run the full experiment as it would have taken too long. In grid search’s defense it is simple to understand and implement. Also I believe I watched a Q&A with an experienced Deep RL lecturer and when asked about what hyperparameter search he uses he said grid search. He said that he uses it because it helps him gather an intuition for the problem and better understand the results.

Random Search

Example scripts: Bust A Move

Basically replace the grid search lists with a distribution that hyperparameters can be chosen from. Example: [0.0,0.1,…1.0] replaced with random.uniform(0.0,1.0). Random search is often used in combination with more advanced algorithms like Hyperband and Population Based Training (PBT). I preferred it over grid search in this example because I could set the hyperparameter ranges to the mins and maxes that I preferred yet still have the experiment finish in a reasonable time. With grid search, I had over 600 combinations that I wanted to try, which would have taken much too long to finish. Instead for random search I set the number of experiments and length of each experiment that I had time budgeted for.

Hyperband Tuning

Examples: the paper, MNIST, Bust A Move

Runs a series of trials and based on the results, decides to halt poor performing trials and continue well performing trials. The authors compare the algorithm to a multi-bandit exploration problem. Users tune hyperparameters for the algorithm that decide how much resources to devote to each trial and how aggressively to cut short poor performing trials.

Population Based Training

Examples: blog, the paper, Humanoid (PPO)

Combines parallelized training of random search with the hand-picked optimization techniques to efficiently find hyperparameters and hyperparameter schedules. Does so by sharing results between models as training progresses and by using explore and exploit techniques.

--

--