Victoria Krakovna, Jonathan Uesato, Vladimir Mikulik, Matthew Rahtz, Tom Everitt, Ramana Kumar, Zac Kenton, Jan Leike, Shane Legg

This article is cross-posted on the DeepMind website.

Specification gaming is a behaviour that satisfies the literal specification of an objective without achieving the intended outcome. We have all had experiences with specification gaming, even if not by this name. Readers may have heard the myth of King Midas and the golden touch, in which the king asks that anything he touches be turned to gold — but soon finds that even food and drink turn to metal in his hands. …

By Siddharth Reddy and Jan Leike. Cross-posted from the DeepMind website.

TL;DR: We present a method for training reinforcement learning agents from human feedback in the presence of unknown unsafe states.

When we train reinforcement learning (RL) agents in the real world, we don’t want them to explore unsafe states, such as driving a mobile robot into a ditch or writing an embarrassing email to one’s boss. …

By Tom Everitt, Ramana Kumar, and Marcus Hutter

From an AI safety perspective, having a clear design principle and a crisp characterization of what problem it solves means that we don’t have to guess which agents are safe. In this post and paper we describe how a design principle called current-RF optimization avoids the reward function tampering problem.

Reinforcement learning (RL) agents are designed to maximize reward. For example, Chess and Go agents are rewarded for winning the game, while a manufacturing robot may be rewarded for correctly assembling some given pieces. …


DeepMind Safety Research

We research and build safe AI systems that learn how to solve problems and advance scientific discovery for all. Explore our work:

Get the Medium app