Scalable agent alignment via reward modeling

By Jan Leike

This post provides an overview of our new paper that outlines a research direction for solving the agent alignment problem. Our approach relies on the recursive application of reward modeling to solve complex real-world problems in a way that aligns with user intentions.

In recent years, reinforcement learning has yielded impressive performance in complex game environments ranging from Atari, Go, and chess to Dota 2 and StarCraft II, with artificial agents rapidly surpassing the human level of play in increasingly complex domains. Games are an ideal platform for developing and testing machine learning algorithms. They present challenging tasks that require a range of cognitive abilities to accomplish, mirroring skills needed to solve problems in the real world. Machine learning researchers can run thousands of simulated experiments on the cloud in parallel, generating as much training data as needed for the system to learn.

Crucially, games often have a clear objective, and a score that approximates progress towards that objective. This score provides a useful reward signal for reinforcement learning agents, and allows us to get quick feedback on which algorithmic and architectural choices work best.

The agent alignment problem

Ultimately, the goal of AI progress is to benefit humans by enabling us to address increasingly complex challenges in the real world. But the real world does not come with built-in reward functions. This presents some challenges because performance on these tasks is not easily defined. We need a good way to provide feedback and enable artificial agents to reliably understand what we want, in order to help us achieve it. In other words, we want to train AI systems with human feedback in such a way that the system’s behavior aligns with our intentions. For our purposes, we define the agent alignment problem as follows:

How can we create agents that behave in accordance with the user’s intentions?

The alignment problem can be framed in the reinforcement learning framework, except that instead of receiving a numeric reward signal, the agent can interact with the user via an interaction protocol that allows the user to communicate their intention to the agent. This protocol can take many forms: the user can provide demonstrations, preferences, optimal actions, or communicate a reward function, for example. A solution to the agent alignment problem is a policy that behaves in accordance with the user’s intentions.

With our new paper we outline a research direction for tackling the agent alignment problem head-on. Building on our earlier categorization of AI safety problems as well as numerous problem expositions on AI safety, we paint a coherent picture of how progress in these areas could yield a solution to the agent alignment problem. This opens the door to building systems that can better understand how to interact with users, learn from their feedback, and predict their preferences — both in narrow, simpler domains in the near term, and also more complex and abstract domains that require understanding beyond human level in the longer term.

Alignment via reward modeling

The main thrust of our research direction is based on reward modeling: we train a reward model with feedback from the user to capture their intentions. At the same time, we train a policy with reinforcement learning to maximize the reward from the reward model. In other words, we separate learning what to do (the reward model) from learning how to do it (the policy).

Schematic illustration of reward modeling: a reward model is trained from the user’s feedback to capture their intentions; this reward model provides rewards to an agent trained with reinforcement learning.

For example, in previous work we taught agents to do a backflip from user preferences, to arrange objects into shapes with goal state examples, to play Atari games from user preferences and expert demonstrations. In the future we want to design algorithms that learn to adapt to the way users provide feedback (e.g. using natural language).

Scaling up

In the long run, we would like to scale reward modeling to domains that are too complex for humans to evaluate directly. To do this, we need to boost the user’s ability to evaluate outcomes. We discuss how reward modeling can be applied recursively: we can use reward modeling to train agents to assist the user in the evaluation process itself. If evaluation is easier than behavior, this could allow us to bootstrap from simpler tasks to increasingly general and more complex tasks. This can be thought of as an instance of iterated amplification.

Schematic illustration of recursive reward modeling: agents trained with recursive reward modeling (smaller circles on the right) assist the user in the evaluation process of outcomes produced by the agent currently being trained (large circle).

For example, imagine we want to train an agent to design a computer chip. To evaluate a proposed chip design, we train other “helper” agents with reward modeling to benchmark the chip’s performance in simulation, calculate heat dissipation, estimate the chip’s lifetime, try to find security vulnerabilities, and so on. Collectively, the outputs of these helper agents enable the user to train the chip designer agent by assisting in the evaluation of the proposed chip design. While each of the helper agents has to solve very difficult tasks that are far out of reach for today’s ML systems, these tasks are easier to perform than designing a chip in the first place: to design a computer chip you have to understand each of these evaluation tasks, but the reverse is not true. In this sense, recursive reward modeling could enable us to “scaffold” our agents to solve increasingly harder tasks while remaining aligned with user intentions.

Research challenges

There are several challenges that will need to be addressed in order to scale reward modeling to such complex problems. Five of these challenges are listed below and described in more depth in the paper, along with approaches for addressing them.

Challenges we expect to encounter when scaling reward modeling (left) and promising approaches to address them (right).

This brings us to the final important component for agent alignment: when deploying agents in the real world, we need to provide evidence to the users that our agents are indeed sufficiently aligned. The paper discusses five different research avenues that can help increase trust in our agents: design choices, testing, interpretability, formal verification, and theoretical guarantees. An ambitious goal would be the production of safety certificates: artifacts that can be used to prove responsible technology development and give users confidence in relying on the trained agents.


While we believe that recursive reward modeling is a very promising direction for training aligned agents, we currently don’t know how well it will scale (more research is needed!). Fortunately, there are several other research directions for agent alignment that are being pursued in parallel:

Their similarities and differences are explored further in the paper.

Just as proactive research into robustness of computer vision systems to adversarial inputs is crucial for ML applications today, so could alignment research be key to getting ahead of future bottlenecks to the deployment of ML systems in complex real-world domains. We have reason to be optimistic: while we expect to face challenges when scaling reward modeling, these challenges are concrete technical research questions that we can make progress on. In this sense our research direction is shovel-ready today for empirical research with deep reinforcement learning agents.

Making progress on these research questions is the subject of ongoing work at DeepMind. If you are a researcher, engineer, or talented generalist interested in joining us, please see our open positions and note your interest in alignment research when you apply.

Thanks to David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, Shane Legg, and many others at DeepMind, OpenAI, and the Future of Humanity Institute who contributed to this effort.