Reinforcement Learning or Evolutionary Strategies? Nature has a solution: Both.
A few weeks ago OpenAI made a splash in the Deep Learning community with the release of their paper “Evolution Strategies as a Scalable Alternative to Reinforcement Learning.” The work contains impressive results suggesting that looking elsewhere than Reinforcement Learning (RL) methods may be worthwhile when training complex neural networks. It sparked a debate around the importance of Reinforcement Learning, and perhaps it’s less than necessary status as the go-to technique for learning to solve tasks. What I want to argue here is that instead of being seen as two competing strategies, one of which being necessarily better than the other, they are ultimately complementary. Indeed, if we think a little bit forward to the goal of Artificial General Intelligence (AGI), and systems that can truly perform lifelong learning, reasoning, and planning, what we find is that a combined solution is almost certainly going to be necessary. And indeed, it is just this solution that nature arrived at for endowing mammals and other complex animal life with intelligence.
The basic premise of the OpenAI paper was that instead of using Reinforcement Learning coupled with traditional gradient backpropagation, they successfully trained neural networks to perform difficult tasks using what they called Evolutionary Strategy (ES). This ES approach consists of maintaining a distribution over network weight values, and having a large number of agents act in parallel using parameters sampled from this distribution. Each agent acts in its own environment, and once it finishes a set number of episodes, or steps of an episode, cumulative reward is returned to the algorithm as a fitness score. With this score, the parameter distribution can be moved toward that of the more successful agents, and away from that of the unsuccessful ones. By repeating this approach millions of times, with hundreds of agents, the weight distribution moves to a space that provides the agents with a good policy for solving the task at hand. Indeed, the most impressive result from the paper shows that with a thousand workers in parallel, humanoid walking can be learned in under half an hour (something that it takes even the best traditional RL methods hours to solve). For more insight, I suggest reading their great blog post, as well as the paper itself.
The Black Box
The great benefit of this approach is that it is parallelizable with little effort. Whereas RL methods such as A3C need to communicate gradients back and forth between workers and a parameter server, ES only requires fitness scores and high-level parameter distribution information to be communicated. It is this simplicity that allows the technique to scale up in ways current RL methods cannot. All of this comes at a cost however: the cost of treating the network being optimized as a black-box. By black-box, I mean that the inner workings of the network are ignored in the training process, and only the overall outcome (episode-level reward) is utilized in determining whether to propagate the specific network weights to further generations. In situations where we don’t get a strong feedback signal from the environment, such as many traditionally posed RL problems with only sparse rewards, turning the problem from a “mostly black box” into an “entirely black box” is worth the performance improvements that can come along with it. “Who needs gradients when they are hopelessly noisy anyways?” So the thinking goes.
In situations with richer feedback signals however, things don’t go so well for ES. OpenAI describes training a simple MNIST classification network using ES, and it being “1000 times slower.” This comes from the fact that the gradient signal in image classification is extremely informative in regards to how to improve the network to better classify. The problem then is less to do with Reinforcement Learning, and more to do with sparse rewards in the environment providing noisy gradients.
Appealing to nature for inspiration in AI can sometimes be seen as a problematic approach. Nature, after all, is working under constraints that computer scientists simply don’t have. It is often believed that a purely theoretical approach to a given problem can provide more efficient solutions than empirical approaches. Despite this, I think it can still be be worthwhile to examine how a dynamic system under certain constraints (earth) can arrive at agents (animals, specifically mammals) with flexible and complex behavior. While some of those constraints don’t hold in simulated worlds of data, many of them still do.
If we look at intelligent behavior in mammals, we find that it comes from a complex interplay of two ultimately intertwined processes, inter-life learning, and intra-life learning. The first is what is typically thought of as evolution via natural selection, but I use a broader term to include things like epigenetics, microbiomes, etc, which are passed between animals not directly related to their genetic material per-se. The second, intra-life learning is all of the learning that takes place during the lifetime of an animal, and is conditioned explicitly on that animal’s interaction in the world. This is also referred to as experience-dependent learning. This category includes everything from learning to recognize objects visually to learning to communicate with learning, to learning that Napoleon was coronated on Sunday, December 2, 1804.
Roughly speaking these two approaches in nature can be compared to the two in neural network optimization. Evolutionary Strategies, for which no gradient information is used to update the organism, is related to inter-life learning. Likewise, the gradient based methods, for which specific experiences change the agent in specific ways, can be compared to intra-life learning. If we think about the kinds of intelligent behaviors or capacities that each of these two approaches enable in animals, we find that the comparison becomes more intelligible. In both cases, the “evolutionary methods” enable the learning of reactive behaviors that achieve a certain level of fitness (enough to stay alive). Learning to walk or play breakout are in many ways equivalent to more “instinctual” behaviors that come genetically hard-wired for many animals. It is also the case that evolutionary methods allow for dealing with extremely sparse reward signals such as the successful rearing of offspring. In a case like that, it is impossible to assign credit for successful childbearing to any specific set of actions that may have taken place years earlier. On the other hand, if we look at the failure case of ES, image classification, we find something remarkably comparable to the kind of learning which animals have accomplished in countless behavioral psychology experiments over the past 100 plus years.
Learning In Animals
The techniques employed in Reinforcement Learning are in many ways inspired directly by the psychological literature on operant conditioning to come out of animal psychology. (In fact, Richard Sutton, one of the two founders of Reinforcement Learning actually received his Bachelor’s degree in Psychology). In operant conditioning animals learn to associate rewarding or punishing outcomes with specific behavior patterns. Animal trainers and researchers can manipulate this reward association in order to get animals to demonstrate their intelligence or behave in certain ways. Operant conditioning applied in animal research however is nothing more than a more explicit form of the conditioning which guides learning for all mammals during their lifetimes. We are constantly receiving reward information from the environment and adjusting our behaviors accordingly. Indeed, many neuroscientists and cognitive scientists believe that humans and other mammals go a step further and constantly learn to predict the outcomes of their behaviors on future rewards and situations.
The central role of prediction in intra-life learning changes the dynamics quite a bit. What was before a somewhat sparse signal (occasional reward), becomes an extremely dense signal. The theory goes something like this: at each moment mammalian brains are predicting the results of the complex flux of sensory stimuli and actions which the animal is immersed in. The outcome of the animals behavior then provides a dense signal to guide the change in predictions and behavior going forward. All of these signals are put to use in the brain in order to improve predictions (and consequently the quality of actions) going forward. For an overview of this approach, see the excellent “Surfing Uncertainty” by Cognitive Scientist and Philosopher Andy Clark. If we apply this way of thinking to learning in artificial agents, we find that RL isn’t somehow fundamentally flawed, rather it is that the signal being used isn’t nearly as rich as it could (or should) be. In cases where the signal can’t be made more rich, (perhaps because it is inherently sparse, or to do with low-level reactivity) it is likely the case that learning through a highly parallelizable method such as ES is instead best.
Richer Learning In Neural Networks
Taking insight from the constantly-predicting neural systems in mammalian brains, there have been some advances in Reinforcement Learning in the past year which incorporate the role of prediction. Two in particular that come to mind are:
In both of these papers the authors augment their network’s typical policy outputs with predictions regarding the future state of the environment. In the case of “Learning to Act” this is in regard to a set of measurements variables, and in the case of “Unsupervised Auxiliary Tasks” this is in regard to predicting changes in the environment and the agent itself. In both cases the sparse reward signal becomes a much richer and informative signal capable of enabling both quicker learning as well as learning of more complex behaviors. These kinds of enhancements are only available to methods which utilize a gradient signal, as opposed to black-box methods like ES.
Intra-life learning / gradient-based methods are also much more efficient. Even in the cases where the ES method learned a task much quicker than the RL method, it did so at the cost of using multiple times more data to do so. When thinking about animal-level learning, inter-life learning requires generations of life to produce changes, whereas a single event can change the behavior of an animal when it takes place through intra-life learning. While this kind of one-shot learning is not entirely yet within the grasp of traditional gradient-based methods, it is much closer than ES. For example, approaches like Neural Episodic Control, which store Q-values during learning and query them when taking actions allows a gradient-based method to learn to solve tasks much quicker than before. In that paper the authors point to the human hippocampus, which is able to store events even after a single exposure, and consequently plays a critical role in their later recall. Such mechanisms require access to the internals of the agent, and as such are also impossible for ES.
So, Why Not Both?
While much of this article perhaps sounds like a championing of RL methods, I ultimately think that the best solution in the long-term will involve a combination of both methods, each used for what they are best at. It is clear that for many reactive policies, or situations with extremely sparse rewards, ES is a strong candidate, especially if you have access to the computational resources that allow for massively parallel training. On the other hand, gradient-based methods using RL or supervision are going to be useful when a rich feedback signal is available, and we need to learn quickly with less data.
If we look to nature, we find that the former actually enables the latter. That is to say that through evolution mammals arrived with brains capable of learning from the complex signals of the world around them in extremely efficient ways. Who knows? It may end up being the case that evolutionary methods help us arrive at efficient learning architectures for gradient-based learning systems as well. Perhaps nature’s solution isn’t so inefficient after all…