Evolution Strategies as a Scalable Alternative to Reinforcement Learning — Paper Summary

4 min readMay 26, 2018

“Flowers in a field near a small mountain pond area.” by Jonas Verstuyft on Unsplash

In 2017 OpenAI produced a paper on Evolution Strategies as a Scalable Alternative to Reinforcement Learning . One of the researchers of the paper, Ilya Sutskever gave this talk on the work OpenAI was doing with Evolution Strategies (ES) and OpenAI has itself produced a blog post on the paper. This is a personal summary of the research.

The paper gives details on the performance of using the ES algorithm to train a deep neural network when applied to Reinforcement Learning problems and compares those results to the more conventional Reinforcement Learning method of using back-propagation. It suggests that for some problems specific to Reinforcement Learning, ES has several advantages. Ilya stresses in his talk that the advantages ES has in the Reinforcement Learning domain do not apply to supervised problems where gradient descent via back-prop already works well.

Parameter Space and Action Space.

Before getting into ES, let’s talk about parameter space and action space. Usually neural networks are trained using back-propagation to update the weights. When applied to Reinforcement Learning, the neural network explores different actions in the environment and these actions return values which is used to update the weights of the network. This can be considered searching in the action space. ES searches for the best weights of the network where the environment is a black-box that returns a score on the performance of any particular set of parameters. This is known as searching the parameter space.

What is Evolution Strategies?

ES is somewhat unfortunately named, as I view it has more of a variation of Random Search than in any way being related to the principles of evolution. Before describing ES, I think it’s helpful to understand Random Search. Consider the problem of finding weights for a neural network in a Reinforcement Learning task, where the network is given a score based on how well it performs in a particular environment. Random Search samples points (potential weights) in a hyper-sphere around the origin. The ‘best’ point is selected and then a hyper-sphere around that best point is sampled. The next best point is found and the cycle repeats until some stopping criteria. ES is similar to Random Search but instead of the new ‘best’ point being decided by the single best performing point, the new ‘best’ is calculated using the average of the difference between the previous ‘best’ and the sampled points, weighted by their ‘fitness’ (the value returned from the environment). Pseudo code is below. In this way, the new ‘best’ will move in the direction that maximizes performance in the environment. This can be considered as a type of gradient descent based on finite differences, even though no gradient is being calculated.

The paper describes the mathematics of why the ES algorithm works but I personally found that it hard to follow because it left out a lot of details. David Barber has written a blog entry clearly going through all the steps and is worth the read if you would like to understand the maths behind the algorithm.

What’s so good about ES?

ES has several advantages compared to training by back propagation.

It saves computation because you don’t have you calculate the gradient.
It’s less susceptible to confusion for tasks involving long time horizons or delayed rewards which is a problem for reinforcement learning with back-prop.
It’s easy to parallelize and can therefore be much faster.

What’s not so good about ES?

Needs more data for similar results. From the experimental results, on the order of 3 to 10 times as much data for some Atari games.

Parallelization

It is possible to separate the individual networks in the ES population to different CPU cores so each network can perform an episode in the environment at the same time. Before starting the algorithm, the random seeds used to create the initial population and the perturbations (mutations) are defined. During training, each CPU worker has knowledge of these seeds, so workers only need to communicate their fitness score, a scalar, to each other to be able to update their parameters.

Experiments

The researchers applied ES to MuJoCo (continuous robotic control, eg walking/hopping/swimming) tasks and several Atari games. ES was able to solve the MuJoCo tasks and performed competitively on the Atari games. Matching the network architecture and level of computation presented in the paper Asynchronous Methods for Deep Reinforcement Learning, ES performed better in 23 games and worse in 28. Due to parallelization, it was possible to solve the 3D Humanoid walking task in just ten minutes using 1440 CPU cores.

Conclusion

This paper shows that ES can be successfully applied to training deep neural networks for Reinforcement Learning problems. It has several benefits in this domain, including scaling to parallel workers and not needing to calculate a gradient which can compensate for the need to use more data. It is exciting to see work demonstrating that algorithms that search the parameter space can be effective in Reinforcement Learning.