RAdam: A New State-of-the-Art Optimizer for RL?

Chris Nota
Autonomous Learning Library
7 min readAug 19, 2019

The choice of optimizer is a somewhat under-studied topic in reinforcement learning. DeepMind reports using RMSprop in most papers (eg., in the original DQN paper), whereas OpenAI seems to prefer Adam (e.g., the PPO paper). However, both DeepMind’s Dopamine library and OpenAI’s Baselines repository seem to default to Adam for newer algorithms, suggesting that it is the defacto standard in RL. Adam was originally published in 2014, a lifetime ago in the current fast-paced era of machine learning. Is it time to move on?

Artist’s rendition of Adam being banished from state-of-the-art implementations.

Recently, there has been some buzz about a modification to Adam called “Rectified Adam” (RAdam), including claims that it represents a “new state of the art.” A number of comments on the linked article claim that it instantly improved accuracy on a wide range of test sets compared to vanilla Adam, on the basis that “[adaptive methods] suffer from a risk of converging into poor local optima — if a warm-up method is not implemented.” That’s all well and good, but how does RAdam stand up to Deep RL, AKA the demon that actively searches for the laziest possible local optima? Let’s find out!

Creating an RAdam Agent

The autonomous-learning-library (all) provides the pieces we need to build Pytorch-based RL agents, and the creators of RAdam were kind enough to provide a Pytorch implementation of their optimizer. Building a new all preset that uses RAdam under the hood will be a piece of cake!

Pictured: The difficulty of testing RAdam using the autonomous-learning-library.

First, we can simply copy radam.py into our local workspace. We won’t need to change any of the existing classes from the autonomous-learning-library, we just need to make sure it is installed in our local Python environment.

We’ll create our new preset in a file called a2c_radam.py. I choose to test RAdam using A2C because it is one of the simplest RL algorithms that still achieves good results on the Atari games we will test against. However, there are still a number of details that we need to get right in order to achieve good results. To make sure we achieve good performance, we’ll base our implementation on the atari.a2c preset provided by the all. The resulting file is shown below:

Gist for A2C with RAdam preset

This code may seem a little intimidating if you are not familiar with the autonomous-learning-library, but all we are doing here is assembling existing components and configuring them to our liking. We imported the RAdam optimizer we just created, as well as several objects from all. We composed a FeatureNetwork, ValueNetwork, and SoftmaxPolicy based on models imported from the all and the RAdam optimizer. Finally, we returned an A2C agent wrapped with a DeepmindAtariBody that gives the agent the same capabilities as the original DQN agent.

Notice that we’ve used a few cool features of all: we’ve defined a learning rate schedule, scaled the loss of the value function, and enabled gradient clipping. The paper which introduced A3C (the asynchronous version of A2C) used all three, so it’s important for us to include them.

Learning rate schedules are often overlooked, but they can greatly improve the stability of learning. I chose to use cosine annealing, as in my experience it performs slightly better than the linear schedule often found in RL papers.

Running the Experiment

As mentioned above, we’re going to be testing against the Arcade Learning Environment (ALE), a standard benchmark suite RL algorithms based on games from the Atari 2600. It is a good benchmark because the tasks are vision-based and provide a reasonable degree of difficulty, while requiring a tractable amount of computing power. Nevertheless, testing against every game in the ALE is fairly expensive, so for small scale experiments researchers will often choose a subset of games.

All 55 games in the ALE. Picture credit to Aaron Defazio and Thore Graepel. Source.

For the most part, only one trial can be run per machine due to the high computational requires of deep RL algorithms. As such, it’s highly beneficial to have access to compute cluster. Fortunately, as a grad student my university provides access to a powerful GPU cluster. The cluster is powered by a protocol called Slurm. Slurm usually requires some finagling with bash scripts, but the autonomous-learning-library provides a class called SlurmExperiment that handles everything under-the-hood. All we have to do is write a simple Python script that defines the agents and environments we want to run:

Script for creating and executing an experiment on a Slurm cluster.

Note some minor details of the experiment. First of all, we are only running the experiment for 40 million frames, compared to the original 200 million in the DQN paper. This is a fairly common choice that trades off between being relatively quick to run while still allowing the agent to achieve good performance. However, in the call to SlurmExperiment we divide this number by four because the frameskip strategy used by most papers causes each timestep to represent 4 frames. As such, 40 million frames = 10 million time steps, which is the value we pass the experiment. Papers and codebases are often quite unclear about which they are referring to.

Second of all, I pulled a fast one on you and tweaked the a2c preset we defined above such that it accepts an optimizer parameter. This is so we can compare Adam and RAdam using the same preset, while guaranteeing that everything else is equal.

The particularly astute reader may notice that the comparison is not 100% fair. Adam and other adaptive step-size methods accept a stability parameter, in Pytorch called eps, that increases the numerical stability of the methods by ensuring the estimate of the variance is always above a certain level. By default, this value is set to 1e-8. However, in deep RL eps if often set to a much, much larger value. For example, in the original DQN paper it was set to 0.01 (see Extended Data Table 1), six orders of magnitude greater than the default. RAdam accepts this parameter also, with the same default.

I have not seen much discussion of this in the literature, but it is a somewhat irksome extra hyperparameter that is easy to forget. Part of the purpose of this experiment was to see if it prevents deep RL researchers from needing to set eps manually, as RAdam allegedly improves the inherently stability of Adam. Therefore, I left it at its default value in both cases to see if it helped.

Results

After dumping the experiment on Slurm and waiting a few hours, the results are in! Is RAdam the breakthrough RL researchers have been waiting for? Ehhhh… not exactly. For 5 of the 6 environments we tested against, the performance of the agent was nearly indistinguishable. To make sure, I reran the experiment and achieved nearly identical results. The results of the first trial are shown below:

Comparison of performance of RAdam and Adam on A2C over 1 trial. Each data point represents the average return over the previous 100 episodes, and the shaded region represents the sample standard deviation over the same data. Displaying these shaded regions stops reviewers from complaining that there are no error bars without actually adding error bars. (Protip: For 1 trial, there aren’t any.)

The one interesting result was that A2C with Adam completely failed to learn Pong in both trials. Pong is considered one of the easiest environments in the ALE, so this is somewhat surprising. However, the reward signal is also fairly sparse, which may present an issue for adaptive methods. This is especially true if we consider the positive reward dynamics. A random agent achieves an average return of around -20.7, meaning the agent only sees a positive reward about once every 3 episodes.

A2C Adam experiment repeated with eps=1e-3 on Pong. Setting this parameter properly seems to be important for Adam, whereas RAdam works well with the default of 1e-8.

Since this particularly failure replicated across both trials, it was worth exploring in more detail. A decent algorithm such as A2C should not fail at Pong, and the only major difference between our implementation and the paper was the choice of eps. I re-ran the Pong experiment with Adam and eps=1e-3. This time, it learned with no trouble.

Conclusions

It’s hard to say that RAdam made much of difference in our experiment, but it certainly didn’t hurt. It’s nice that it spares us the need to set eps, but overall the gains seem marginal, and non-existent more often than not. However, it’s worth keeping in mind that A2C itself was designed and tuned with RMSprop and Adam in mind, so it’s not surprising that Adam worked well. Perhaps we could do away with some of the finer details, such as gradient clipping, if we used RAdam instead of vanilla Adam. Overall, I’d recommend giving RAdam a shot! It won’t hurt!

--

--