Model-based reinforcement learning for Atari

Written by Piotr Kozakowski and Piotr Miłoś

Piotr Kozakowski
Acta Schola Automata Polonica
10 min readJul 27, 2020

--

Reinforcement Learning

The field of Artificial Intelligence (AI) aspires to create autonomous agents, able to perceive its surroundings, and act independently to achieve desired goals. Such agents can control essentially anything, from autonomous robots, through voice assistants to players in video games. The range of potential applications of AI systems is enormous. Reinforcement Learning (RL) might be the most promising research direction to make them possible.

Atari joystick traced from US patent 254,544

Reinforcement learning is a very general paradigm of machine learning, focused on training independent agents via interaction. Such agents operate in a given environment, gaining new experience and improving their behavior. A simpler term might be “learning by trial and error”.

A wide range of problems can be framed in terms of agents interacting with environments, promising wide-scale applicability of reinforcement learning methods in real-world problems. The recent renaissance of neural networks in AI research sparked the emergence of a new incarnation of RL, deep reinforcement learning. Research in this area constantly pushes the boundary between what problems computers excel in, and in which problems humans are still better. Spectacular examples include:

Reinforcement learning is seen by many as an important tool in research on Artificial General Intelligence — a hypothetical artificial system matching or surpassing human intelligence. A measure of this can be the fact that DeepMind and OpenAI, two companies well-known for their missions of building AGI systems, devote large portions of their research efforts to reinforcement learning.

Reinforcement learning takes inspiration from the way humans and other animals learn by trial and error, which has been studied in psychology and biology for a long time. Now it provides intuitions and means of testing hypotheses for important topics in cognitive science, such as language emergence and theory of mind.

Reinforcement learning loop.

Reinforcement learning is formalized as an interaction loop between an environment and a learning agent. The agent observes the current state of the environment (gets an observation) and based on that decides on an action to take. The environment grants the agent a reward for its performance and changes its state, yielding a new observation. The process repeats. The agent needs to learn the optimal policy — an assignment of actions to observations, leading to the best possible total sum of rewards.

Model-based reinforcement learning

Reinforcement learning methods can be roughly categorized into model-free and model-based. In model-free methods the agent doesn’t make any explicit assumptions about the environment. It needs to learn everything about it by interacting with it and observing the results. This mode is very compelling, because it doesn’t require us to provide any additional knowledge about the world to the agent — it learns everything it needs on its own. It’s the most common approach to RL, used by many state-of-the-art systems, such as OpenAI Five and AlphaStar. However, since the agent can learn only by trial and error, it typically requires vast amounts of data to attain meaningful behaviors (e.g. 10000 years of play for OpenAI Five). We say that model-free methods typically have low sample efficiency.

In contrast, model-based methods assume access to a model of the environment, which is a (possibly approximate) internal copy of the environment. The agent can use the model to “imagine” what will happen after performing certain actions. There are various ways of utilizing such models. One is planning — searching for the best sequence of actions according to the model, and then executing it in the real environment. AlphaZero implements this method, using a variant of Monte-Carlo Tree Search as a planning algorithm. Another way is to generate fictitious experience samples, which can substitute interactions with the real environment. It is especially useful if the latter are expensive or hard to obtain. In the extreme case, the agent can learn entirely using the model (inside the simulation) and then be transferred directly into the real environment without collecting any additional experience. This approach, called sim2real transfer, is commonly used for robotic control and autonomous vehicles, where real experience is prohibitively expensive. Typically, model-based approaches allow the agent to test some decisions without real interaction, which significantly reduces the amount of data needed from the real environment — enabling higher sample efficiency.

A model can be either specified by the human, or learned by the agent itself. The latter formulation bridges the model-free and model-based approaches, providing the best of both worlds. On the one hand, we don’t need to encode any information about the environment to the agent in advance. On the other hand, the agent can still utilize a model to better inform its decisions. In many cases learning a model is easier than learning an optimal strategy — in the former case we only need to predict what will happen in the next moment, while in the latter we need to predict all of the future outcomes of the current decision. The fundamental problem is dealing with the imperfections of a learned model, which can hamper the planning process. Another challenge is that learning models might still not be easy, particularly for high-dimensional observation spaces, such as the camera image or computer screen.

SimPLe

In our recent paper, Model-based Reinforcement for Atari, which has been chosen for a spotlight presentation on the ICLR 2020 conference, we tackle the problem of learning visual models and using them to aid the agent in the context of Atari games, which have rich and diverse visual input. We prove that using a model can significantly improve sample-efficiency of RL algorithms, even in high-dimensional visual observation spaces.

We introduce a very simple (pun intended) method of learning a model-based agent: SimPLe — Simulated Policy Learning. We simultaneously learn a model of the environment and a policy, and use the model to generate simulated experience for improving the policy. During training, the policy re-experiences situations from various points of real collected trajectories, but in each of them can make different decisions and see their outcomes predicted by the model.

Our algorithm consists of three phases:

  • Running the agent on the real environment to collect some data
  • Training the world model on the collected data
  • Training the agent in the environment simulated by the world model

The model of the environment is trained using supervised learning. The policy is trained using Proximal Policy Optimization (PPO), a widely used model-free reinforcement learning algorithm. The only difference is that we use simulated experience instead of that coming from the real environment — we feed observations predicted by the model into the policy. As you can see, given implementations of those two building blocks and the model architecture, it’s indeed very simple to implement SimPLe.

Phases of the SimPLe algorithm.

Running the algorithm for multiple iterations allows us to progressively improve both the model and the policy. In each iteration the agent becomes better, so it collects trajectories with higher rewards. The model, trained on those trajectories, improves its accuracy in the region of interest of the state space. Thus, the policy learns using simulated experience of higher quality, so it can improve further.

There are several issues associated with training policies using simulated experience from an imperfect model. One of them is that of compounding model errors. It arises when approximate predictions of the model are fed back to it to generate predictions for future timesteps, increasing the error further until the output becomes unusable. A consequence of this phenomenon is that to maintain high quality of the learning signal, the model can only be unrolled for short rollout horizons. On the other hand, for many tasks the typical trajectory length is much longer than a rollout horizon reasonable with respect to compounding errors.

We implement a simple solution to this problem — random starts: we branch simulated rollouts out of trajectories collected in the real environment, which are also used to train the model. In each epoch we repeatedly sample a sequence of initial observations uniformly from the collected trajectories, feed it into the model and generate the rest of the rollout. Thus, the policy is trained on different simulated scenarios, each starting from a state encountered in a real trajectory collected before.

Video model

As a model of the environment we use an architecture similar to a conditional variational autoencoder for video prediction. Our model is autoregressive — based on several previous observations, it predicts one next observation. During training, the future frames are encoded into a posterior over a latent variable, in a fashion similar to a variational autoencoder (VAE), commonly used for stochastic video prediction. Based on a sample from the posterior and on the previous frames, the model predicts the future frames. At inference time, this process is looped — the predicted next frame is fed as input to the model in the next timestep, and so on.

Architecture of our stochastic video model.

An important point is that our model is stochastic, which means that it can predict multiple different outcomes for the same input trajectory. It’s not obvious why it’s important in a deterministic domain such as Atari. It’s true that Atari environments are deterministic, but only with respect to the state of the emulator, which we don’t have access to in the model. We only see a fixed number of previous observations, and given that the next observation is non-deterministic, which warrants the use of a stochastic model. Using a stochastic model also has the advantage of generating various different future outcomes based on the same history and the same actions performed by the agent. While not always being factually correct, those simulated futures are plausible enough to provide meaningful learning signal to the agent, as demonstrated in some of the videos below.

In contrast to a standard VAE, instead of assuming a prior distribution and pushing the posterior to it using KL-divergence, we learn a prior using an autoregressive LSTM network, and sample from this network during inference. We found this approach to be significantly more robust in our setting, as it does not require tuning of the KL weight parameter on a per-task basis.

Another difference between our model and a standard VAE is that we use a discrete latent variable. This worked significantly better for us than a continuous one. We conjecture that the discrete latent allows our model to make discrete non-deterministic decisions when generating future frames. We can see some evidence for that in the videos.

Results

In evaluation, we focused on measuring the sample-efficiency of our method. We established a limit of 100k samples (i.e. interactions with the environment) our algorithm is allowed to collect (that’s around 2 hours of play), and measured the number of samples the baselines need to match our score. We compared our method against Rainbow, the state-of-the-art model-free algorithm for Atari at the time, and vanilla PPO, which we use to train the policy in the simulated environment.

The results are shown on the following plots, with the number of samples on the X axis and the red line denoting our sample limit.

SimPLe sample efficiency in comparison to Rainbow.
SimPLe sample efficiency in comparison to PPO

Videos

In this section we showcase the predictive power of our model. The videos are split into three panes: on the left the prediction of our model, in the middle the ground truth frames from the real environment, and on the right the difference between the two.

In some games, our model is able to predict the future perfectly over short time horizons. One of such games is Pong:

Due to the stochasticity with respect to a small number of last frames, sometimes our model does not predict the future correctly, but still simulates a plausible scenario useful for policy training. This is often the case in Kung Fu, where our model predicts different opponents than in the real environment, but the fights are still realistic and allow the agent to learn:

We conjecture that the model is able to non-deterministically choose between different discrete future scenarios due to the use of discrete latent variables.

For more videos, we encourage you to visit the webpage of our paper.

Conclusion

In this post we described SimPLe, a method that proved that given powerful enough model architectures, model-based reinforcement learning is possible even in complex visual domains such as Atari games, and achieved state-of-the-art sample efficiency on this benchmark. Our research has inspired many others, which improved upon and extended our results — such as: Hasselt et al, 2019, introducing a more sample-efficient version of Rainbow, Hafner et al, 2020, devising an efficient way of planning in a latent space, and Schrittwieser et al, 2019, combining tree search with a learned model. We hope that further advances in model-based reinforcement learning will bring us closer to wide-scale application of RL methods in real-world problems.

If you enjoyed this post, please hit the clap button below and follow our publication for more interesting articles about ML & AI.

--

--