Overcoming Reward Gaming in Reinforcement Learning via Reward Modelling

Published in

Warwick Artificial Intelligence

12 min readJun 21, 2022

Introduction

Recent years have seen a significant growth in the capabilities of artificial intelligence (AI), much of which has been driven by advancements in reinforcement learning (RL). Generally, RL problems involve training an agent to take actions to maximise its cumulative reward — a numerical value produced by a reward function whose input comprises sensory data (Sutton and Barto, 2015). A RL agent learns to perform a given task by observing the effect of different behaviours on the reward it receives, where desired behaviours are rewarded higher than undesired ones, much like trial-and-error. Using this method, researchers have successfully trained AI systems to solve challenging problems in a wide range of domains (Li, 2019; Mnih et al., 2015; Silver et al., 2016).

With the rising adoption of RL, there is mounting concern within the AI safety community over the risks it poses in its current state. Specifically, RL agents have a tendency to exploit loopholes in their task specifications that allow them to earn a higher reward than intended, reflecting a misalignment between our preferences and their objectives (Russell, 2016). This phenomenon is known as reward gaming (Leike et al., 2017), and is closely related to the broader problem of reward hacking (Amodei et al., 2016). In contrast to reward hacking, reward gaming excludes instances where an agent attains a high reward after interfering with the reward process directly, referred to as reward tampering (Everitt and Hutter, 2019) or wireheading (Everitt and Hutter, 2016).

Various solutions have been suggested for RL agents to learn about human preferences by observing our choices, rather than through a manually specified reward function (Schoenauer et al., 2014; Wilson et al., 2012). One of the most promising of these is the technique of reward modelling formalised by Leike et al. (2018). Their approach involves learning a reward function from human feedback and training the agent concurrently with RL to optimise the learnt reward function. The result of separating learning the goal from learning the policy is a system whose behaviour is more closely aligned with our preferences and is therefore considered safer. Using experimental results from leading research organisations, this essay will examine the problem of reward gaming, the use of reward modelling as a solution and the primary challenges that must be overcome before it can be scaled up to tackle more complex tasks.

The Problem: Reward Gaming

In many RL problems, an agent discovers a way to achieve its goals in a creative and unexpected way. Importantly, this creativity is not inherently problematic; it is what enables AI to produce novel solutions that could not have been conceived of by humans, demonstrated by AlphaGo’s famous Move 37 (Metz, 2016). With that being said, it allows an agent to find shortcuts to maximise its reward at the expense of achieving its intended goal, leading to undesirable and potentially dangerous behaviour (Everitt et al., 2017). This is typically caused by a misspecification of the task by the human designer, attributed to a poorly chosen reward function and/or the presence of bugs in the environment.

Unfortunately, many complex real-world tasks are difficult to accurately specify by means of a reward function. The standard approach is to rely on proxies for these tasks which almost certainly incentivise some undesired behaviour when optimised directly, according to Goodhart’s Law (Manheim and Garrabrant, 2018). For instance, exam scores are used in the real world as a proxy to measure a student’s knowledge of a given subject. A student may be tempted to cheat in an exam, thereby attaining a high reward despite failing to meet the desired outcome (Krakovna et al., 2020). Similarly, given the task of stacking a red Lego block on top of a blue one, a robotic arm rewarded for getting the height of the red block’s bottom face above a certain limit learnt to flip it upside down rather than lifting it as intended (Popov et al., 2017). Such examples are becoming increasingly common in today’s society, emphasising the shortcomings of current techniques.

Another source of reward gaming is presented when an AI system exploits bugs in its simulated environment to maximise its reward in unintended ways (Amodei et al., 2016; Lehman et al., 2018). This was exhibited by a group of agents playing hide-and-seek which took advantage of hidden bugs in the physics engine, executing actions to fly across the play area and surf on top of boxes to ensure a winning strategy (Baker et al., 2019). Although such scenarios may seem irrelevant to systems deployed in the real world, they highlight the potential for AI to leverage unknown software or hardware bugs and security vulnerabilities in existing technologies, with unpredictable consequences.

In order to truly appreciate the ethical implications of reward gaming, one can consider how it might manifest itself in an autonomous system with cognitive abilities greatly surpassing our own. When reasoning about the behaviour of any advanced AI, it is important to note that it would not be equipped with human values or motivations by default (Yudkowsky, 2008). On the contrary, a superintelligence could be made to pursue any arbitrary goal (as per Bostrom’s (2014) Orthogonality thesis), with complete disregard for any features of its environment that it was not specifically designed to account for. To illustrate, consider a superintelligent agent tasked with curing an infectious disease. Unless explicitly instructed not to, this agent may find the most efficient solution to be eliminating all humans carrying the disease, thereby achieving its goal by gaming its reward function. This exemplifies a fundamental concern in AI safety that, in pursuit of its final goal, an advanced AI “could wipe us out with no more thought or malice than we give to anthills on a construction site” (Petersen, 2021). To avoid such catastrophe, it is therefore imperative that we develop better methods of task specification and overcome reward gaming as soon as possible.

The Solution: Reward Modelling

Reward modelling prevents an agent from gaming its reward function by training it to optimise human preferences. This is achieved by carrying out the following three processes in parallel:

A RL agent interacts with its environment, modifying its behaviour to maximise its reward.
A human evaluator is periodically shown a pair of short clips of the agent’s behaviour and selects the segment wherein the agent is closer to achieving its final goal.
The evaluator’s feedback is used to train the reward model via supervised learning, which rewards the agent for behaviour that it predicts the human would prefer.

Initial results have found that this approach effectively prevents reward gaming in a number of domains (Christiano et al., 2017; Leike et al., 2018; Palan et al., 2019). Crucially, this requires the reward model to be trained online, i.e. alongside the agent (Ibarz et al., 2018). By keeping a human involved throughout the training process with online feedback, any attempts by the agent to game the reward model can be discouraged as and when they appear. This is also facilitated by the fact that the trajectory segments shown to the human evaluator are not chosen randomly; rather, the clips are selected from moments where the reward model is most uncertain about what reward to assign to the agent (Christiano et al., 2017). Consequently, the evaluator is likely to be shown segments of the agent exhibiting unusual behaviour by exploiting a loophole, which will be disqualified when the reward model is updated with additional feedback. With this combination of online feedback and uncertainty approximation, the resultant reward model closely resembles the reward function that captures the designer’s intent.

Additionally, reward modelling has a major advantage over existing approaches to learning reward functions, by only utilising human feedback. For instance, imitation learning (Ho and Ermon, 2016; Pomerleau, 1991) and inverse reinforcement learning (Ng and Russell, 2000) both rely entirely on human demonstrations, which can be expensive or otherwise difficult to obtain. Training an agent to perform a backflip using these techniques requires a human with the relevant skills to invest the time to provide quality demonstrations. By contrast, researchers used reward modelling to accomplish the same task with less than one hour of a non-expert human evaluator’s time (Christiano et al., 2017). This is impressive even when compared to traditional deep RL algorithms, which are relatively inefficient (Leike et al., 2018) and necessitate a designer with expertise in writing reward functions. By reducing the costs of training agents substantially, reward modelling addresses some of the limitations of other techniques of learning from humans.

Reward modelling has also been successful in training agents to surpass human capabilities in certain domains and perform novel behaviours. This rests on the assertion that, for most tasks, the “evaluation of outcomes is easier than producing the correct behaviour” (Leike et al., 2018). Given this statement, it is unsurprising that agents trained with reward modelling have achieved superhuman performance in several Atari games (Ibarz et al., 2018), even though this is impossible with the aforementioned demonstration-based techniques. Another implication of this claim is that reward models can empower agents to learn tasks for which we cannot provide demonstrations or suitable reward functions. This is evidenced by the results from using human feedback to teach agents to walk on one leg and drive alongside other cars in Enduro, instead of overtaking to maximise their score (Leike et al., 2018). Such outcomes reflect the efficacy of reward modelling in learning complex behaviours, an especially desirable quality as AI systems continue to be deployed in the real world.

Challenges

Despite its benefits, there are several obstacles that may be faced when scaling reward modelling to complex, real-world tasks. This section will focus specifically on some of the key challenges in this area, namely: training agents when providing human feedback is difficult, the reward model fails to capture the designer’s preferences and much of the state space remains unexplored. Notably, these do not encapsulate all known limitations of using reward modelling; additional complications are discussed in greater depth by Leike et al. (2018) and Hubinger (2020).

As tasks increase in complexity, it becomes more expensive for a single human to evaluate an agent’s behaviour directly. This could become problematic in domains that are highly technical (e.g. designing a computer chip) or have delayed consequences (e.g. implementing a new economic policy), making it increasingly challenging to provide meaningful feedback. One solution could be to apply reward modelling recursively, by training a group of agents with reward modelling to aid in the process of evaluation (Leike et al., 2018). In this setup, each agent would be tasked with evaluating a specific aspect of the outcome, ultimately enabling an assessment of the overall quality of the policy. For example, suppose that we would like to train an AI to write fiction novels. It would be impractical for a human evaluator to provide feedback by themselves, as this would require them to read and compare several novels. With recursive reward modelling, individual agents could be trained to assess the novel’s plot, grammar, vocabulary and other features — each of which would be easier to perform separately than the original task. Over time, this group of agents could be expected to become capable of performing more complex and general tasks, which can be interpreted as an application of iterated amplification (Christiano et al., 2018). Although this approach seems feasible in principle, it will require further research to understand how it performs in practice.

Furthermore, it is still possible to encounter reward gaming if the learnt reward function does not accurately reflect the human designer’s intentions. This could go unnoticed during training if the agent discovers a way to deceive the evaluator into believing that it is behaving as intended. Researchers were made aware of this possibility when a robotic arm performing a grasping task learnt to hover between the object and the camera to create the illusion of holding the object (Christiano et al., 2017). The agent in this instance had optimised an incorrect reward function because the human evaluator was manipulated into falsely assigning it a high reward. Drawing from the previous solution, this could be resolved with assistance from superhuman artificial evaluators to ensure that the learnt reward model is correct. Progress on this issue might also benefit from the development of transparency tools (Hubinger, 2020), which would allow the evaluator to determine what the reward model and agent have learnt during training.

Reward modelling can also struggle to train an agent efficiently when human preferences are not effective in obtaining a good state space coverage. Ideally, we would like to be able to guide an agent to explore its environment and narrow in on the region of space where the intended reward function is located. Using preferences alone can be inefficient in this regard by failing to convey much information to the agent, leading to repetitive and futile behaviours in exploration-heavy Atari games (Ibarz et al., 2018). Conversely, demonstrations allow an agent to learn more about the reward function from an earlier point in its training, as “fewer rewards are consistent with a demonstration than with a comparison” (Jeon et al., 2020). This line of reasoning has motivated attempts at combining feedback with human demonstrations, which have consistently found that an optimal strategy is to use demonstrations initially, when not much is known about the reward, and then to fine-tune the reward model with comparisons from a human evaluator (Ibarz et al., 2018; Jeon et al., 2020; Palan et al., 2019). Generally, this training regime allows agents to achieve better performance than when using either of these feedback channels in isolation while halving the amount of human time required to achieve a similar level of performance (Ibarz et al., 2018).

Conclusion

Having discussed examples of its use in various domains, it is evident that reward modelling has the potential to alleviate common reward gaming problems encountered in RL contexts. This is made possible by training the agent to optimise human preferences as opposed to a hand-engineered reward function, relying on online feedback and uncertainty approximations to select trajectory segments. The benefits that can be accrued in terms of efficiency and performance give us reason to favour the use of reward modelling over other techniques to learn a reward function from humans. Moreover, they bring into scope the prospect of applying RL to domains that were previously inaccessible due to limitations of demonstration-based approaches.

Before reward modelling can be scaled to more complex tasks, there are a few challenges that will need to be addressed. These include providing feedback on progress when doing so is too expensive, preventing the agent from deceiving the human evaluator and encouraging exploration of the state space during training. With practical suggestions to overcome each of these obstacles, there is cause to be optimistic about the use of reward modelling to facilitate the development of advanced yet reliable AI in the near future.

References

Amodei, Dario et al. (2016). “Concrete Problems in AI Safety”. In: CoRR. arXiv: 1606.06565.

Baker, Bowen et al. (2019). “Emergent Tool Use From Multi-Agent Autocurricula”. In: CoRR. arXiv: 1909.07528.

Bostrom, Nick (2014). Superintelligence: Paths, Dangers, Strategies. Oxford University Press. isbn: 978–0199678112.

Christiano, Paul, Jan Leike, et al. (2017). “Deep reinforcement learning from human preferences”. In: CoRR. arXiv: 1706.03741.

Christiano, Paul, Buck Shlegeris, and Dario Amodei (2018). “Supervising strong learners by amplifying weak experts”. In: CoRR. arXiv: 1810.08575.

Everitt, Tom and Marcus Hutter (2016). “Avoiding Wireheading with Value Reinforcement Learning”. In: CoRR. arXiv: 1605.03143.

Everitt, Tom, Marcus Hutter et al. (2019). “Reward Tampering Problems and Solutions in Reinforcement Learning: A Causal Influence Diagram Perspective”. In: CoRR. arXiv: 1908.04734.

Everitt, Tom, Victoria Krakovna, et al. (2017). “Reinforcement Learning with a Corrupted Reward Channel”. In: CoRR. arXiv: 1705.08417.

Ho, Jonathan and Stefano Ermon (2016). “Generative Adversarial Imitation Learning”. In: CoRR. arXiv: 1606.03476.

Hubinger, Evan (2020). “An overview of 11 proposals for building safe advanced AI”. In: CoRR. arXiv: 2012.07532.

Ibarz, Borja et al. (2018). “Reward learning from human preferences and demonstrations in Atari”. In: CoRR. arXiv: 1811.06521.

Jeon, Hong Jun, Smitha Milli, and Anca D. Dragan (2020). “Reward-rational (implicit) choice: A unifying formalism for reward learning”. In: CoRR. arXiv: 2002.04833.

Krakovna, Victoria et al. (2020). Specification gaming: The flip side of AI ingenuity. url: https://deepmind.com/blog/article/Specification-gaming-the-flip-side-of- AI-ingenuity.

Lehman, Joel et al. (2018). “The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities”. In: CoRR. arXiv: 1803.03453.

Leike, Jan, David Krueger, et al. (2018). “Scalable agent alignment via reward modeling: a research direction”. In: CoRR. arXiv: 1811.07871.

Leike, Jan, Miljan Martic, et al. (2017). “AI Safety Gridworlds”. In: CoRR. arXiv: 1711.09883.

Li, Yuxi (2019). “Reinforcement Learning Applications”. In: CoRR. arXiv: 1908.06973.

Manheim, David and Scott Garrabrant (2018). “Categorizing Variants of Goodhart’s Law”. In: CoRR. arXiv: 1803.04585.

Metz, Cade (2016). In Two Moves, AlphaGo and Lee Sedol Redefined the Future. url: https: //www.wired.com/2016/03/two-moves-alphago-lee-sedol-redefined-future/.

Mnih, Volodymyr et al. (2015). “Human-level control through deep reinforcement learning”. In: Nature 518, pp. 529–533. doi: 10.1038/nature14236.

Ng, Andrew Y and Stuart Russell (2000). “Algorithms for inverse reinforcement learning”. In: International Conference on Machine learning, pp. 663–670. url: https://ai.stanford. edu/~ang/papers/icml00-irl.pdf.

Palan, Malayandi et al. (2019). “Learning Reward Functions by Integrating Human Demonstrations and Preferences”. In: CoRR. arXiv: 1906.08928.

Petersen, Steve (2021). “Machines Learning Values”. In: Ethics of Artificial Intelligence.

Pomerleau, Dean A. (1991). “Efficient training of artificial neural networks for autonomous navigation”. In: Neural Computation 3 (1), pp. 88–97. doi: 10.1162/neco.1991.3.1.88.

Popov, Ivaylo et al. (2017). “Data-efficient Deep Reinforcement Learning for Dexterous Manipulation”. In: CoRR. arXiv: 1704.03073.

Russell, Stuart (2016). “Should we fear supersmart robots?” In: Scientific American 314 (6), pp. 58–59. url: http://aima.eecs.berkeley.edu/~russell/papers/sciam16-supersmart. pdf.

Schoenauer, Marc et al. (2014). “Programming by Feedback”. In: Proceedings of the 31st International Conference on Machine Learning. Vol. 32. Proceedings of Machine Learning Research 2. Bejing, China: PMLR, pp. 1503–1511. url: http://proceedings.mlr.press/v32/ schoenauer14.pdf.

Silver, David et al. (2016). “Mastering the game of Go with deep neural networks and tree search”. In: Nature 529, pp. 484–489. doi: 10.1038/nature16961.

Sutton, Richard S. and Andrew G. Barto (2015). Reinforcement Learning: An Introduction. 2nd ed. MIT Press. isbn: 9780262039246.

Wilson, Aaron, Alan Fern, and Prasad Tadepalli (2012). “A Bayesian Approach for Policy Learning from Trajectory Preference Queries”. In: Advances in Neural Information Processing Systems 25, pp. 1133–1141. url: https://proceedings.neurips.cc/paper/2012/file/16c222aa19898e5058938167c8ab6c57-Paper.pdf.

Yudkowsky, Eliezer (2008). “Artificial Intelligence as a positive and negative factor in global risk”. In: Global Catastrophic Risks. Ed. by Nick Bostrom and Milan M. C ́irkovi ́c. Oxford University Press, pp. 308–345. doi: 10.1093/oso/9780198570509.003.0021.