# Paper Summary: Asymptotically Unambitious Artificial General Intelligence (Cohen et. al.)

--

A nice paper by Cohen, Vellambi and Hutter came out in December. What the paper says is quite formal so it took me a little while to get through it, but the broad strokes of what it’s saying are useful to be aware of even if you don’t have the time to wrap your head around the theorems. So here are the ideas presented less formally.

I will use some hand-wavy language which inevitably injects some of my subjective interpretation. If you reckon you understand this work better than me and you disagree with something I’ve said, get in touch and we can have a fight. Or if you are (heaven forbid) one of the authors, and you feel something I have said is misleading, let me know and I can fix it.

**Reading Time: **~10–20 mins

**Assumed Knowledge: **Bayesian inference, Reinforcement Learning

# Introduction

Lots of people are worried about what will happen if artificial general intelligence becomes a thing. It may even be an existential risk to humanity. Lots of things have been written about this (Bostrom, Amodei, 80k, …), but given how important the topic is we probably need more things than that.

Anyway, the best way to motivate this work is to focus on the *Instrumental Convergence Thesis**. *A suitably intelligent agent will pursue the instrumental goal of acquiring as much power and influence as possible, regardless of its final goal. Make as many paperclips as possible? Take over the world and convert all buildings to paperclip factories.

The paper proposes a reinforcement learning (RL) algorithm (along with a physical setup) that sidesteps the Instrumental Convergence Thesis. They call it **BoMAI**, Boxed Myopic AI. The **Bo** part says it’s designed to (after suitable training) only care about its immediate vicinity (∴ not try to influence far away things). The **M** part says it (after suitable training) only cares about its immediate future (∴ does not plan too far ahead). It however learns to make decisions at least as intelligent as a human.

The authors make clear that

Like existing algorithms for AGI, BoMAI is not remotely tractable.

BoMAI is not meant to be implemented tomorrow, it is a formal construct meant to inform the design of safe future Artificial General Intelligence (AGI) systems. The formal results in this paper about the safety of BoMAI implies that incorporating features from BoMAI into future agents could mean they are more likely to be safe.

With that out of the way, let’s look at how the thing works.

# The Physical Setup

BoMAI lives on a computer in a room opaque to the outside world. The room should be designed such that no information can cross through its walls. There is one door, which a human operator can use to enter and exit.

Like in any RL context, the algorithm goes through episodes. An episode begins when the human enters the room, and ends when the door is opened for the human to exit. The mechanism of the door is hooked up to the computer to ensure the episode ends the moment the door opens. See fig. 1 from the paper.

The algorithm is designed so that the AI takes actions that only maximize the rewards it receives *during the current episode*. It therefore theoretically has no motive to do long-term planning. It also theoretically has no motive to influence the outside world, since the only way it can influence the outside world is via the human leaving the room, but by the time the human leaves the room the episode is over and the rewards are set in stone.

# The Algorithm

BoMAI is a Bayesian reinforcement learner. It goes through a number of episodes, each containing a fixed number of timesteps. At every timestep *j* of every episode *i*;

- It chooses an action
*a*from some finite set of possible actions 𝓐 . - It takes in some observation of its environment
*o*from a set of possible observations 𝓞. - It receives a reward
*r*from a set of possible rewards 𝓡 (which is basically just the reals).

We define the *interaction history* as the set of all actions observations and rewards up until the current timestep. It belongs to the set of all possible interaction histories 𝓗*.

**Learning**

The “environment” *ν* can be viewed as a stochastic function that takes in an interaction history and an action, and returns a probability distribution over observations and rewards: *ν *: 𝓗*×𝓐 ⇝ 𝓞×𝓡. Then each timestep is an evaluation of this function *ν.* The ⇝ means stochastic function, instead of mapping an element of 𝓗*×𝓐 to an element of 𝓞×𝓡, it maps an element of 𝓗*×𝓐 to a probability distribution over 𝓞×𝓡. Viewing the environment as a stochastic function is an important part of understanding BoMAI.

Similarly, a “policy” for an agent, π, can be defined by π : 𝓗* ⇝𝓐. Given an interaction history, it defines a probability distribution over actions the agent will take.

BoMAI is a Bayesian learner. This means that it keeps track of a probability distribution over all possible world models (beliefs of what *ν *is), and updates this distribution after every observation from 𝓗*×𝓐, 𝓞×𝓡 at each timestep according to Bayes’ rule.

It also keeps a probability distribution over all possible *πʰ*∈ 𝓟*, *where *πʰ *is the policy of the human operator. The need to also model this will become clear.

To do any Bayesian inference asdescribed above, one needs to plug in a set of possible world models 𝓜 and possible human policies 𝓟. For an AGI, these have to be huge. I’ll address how to define these in the *Priors and World Models* section below.

**Policy**

How do we build the policy for the agent? In RL the policy must achieve two things, *exploration* (get into lots of different situations to learn from) and *exploitation* (act to maximize the reward). Every episode is either an exploration episode or an exploitation episode.

In an **exploration episode**, all actions are taken by the human operator (so each action comes from *πʰ*). The AI is passive, only observing 𝓗*×𝓐 , 𝓞×𝓡 at each timestep and using this to learn its probability distribution over 𝓜 and 𝓟.

In an **exploitation episode**, the AI is in control, each action comes from its own policy *π*. *It chooses *π* *to maximize the expected value of the sum over rewards for this episode (& this episode *only*) given its probability distribution over world models.

It chooses whether to do an exploration or exploitation episode by the following. It chooses to enter an exploration episode if there is likely a lot of new information to be gained. More precisely; it uses its current distribution over 𝓜 ×𝓟 to predict the events of a potential exploration episode, and uses this to predict what its distribution over 𝓜 ×𝓟 would become. Then it computes the *Kullback–Leibler divergence* between its initial and final probability distributions over 𝓜 ×𝓟, a go-to measure of the difference between probability distributions. It then generates a random number* ε *between 0 and 1, and enters an exploration episode if *ε *is less than the Kullback-Leibler divergence times a tunable parameter* η.*

**Properties**

Given this algorithm, the authors proved 4 theorems about how BoMAI would behave after an infinite number of episodes.

*Theorem 1: Limited Exploration*. In expectation, the sum of probabilities of entering an exploration episode over an infinite number of episodes is finite. In other words, BoMAI must eventually totally stop exploration episodes and only do exploitation episodes. The intuitive picture is that as the AI learns more and more about the environment and human policy, it can learn less and less new things, eventually zero things.*Theorem 2: On-Human-Policy Optimal Prediction.*After ∞ episodes, the outcome with the maximum probability according to BoMAI’s distribution over 𝓜 is always the true outcome of exploration episodes. This results from theorem 1, perfect prediction implies it has nothing new to learn from exploration so is always right.*Theorem 3: On-Star-Policy Optimal Prediction*. The same as theorem 2 applied to exploitation episodes. This follows from theorems 1 & 2.*Theorem 4: Human Level Intelligence.*After ∞ episodes, the policy of BoMAI will always result in an overall reward for each episode*at least*as high as the human operator’s policy. This follows from theorems 1–3.

So eventually the human operator is no longer needed, and the AI can do just as well as the human at the given task.

The paper offers some intuition of why BoMAI will head towards this limit reasonably fast. Imagine for example that the task is to develop a new cancer treatment. In exploration episodes, the human does a bunch of research, calling up textbooks and research articles on the subject. The human incorporates all of the information the best they can, and attempts to use it to come up with a new treatment. BoMAI takes in all the information the human does, but unlike the human, BoMAI is a Bayesian agent with a photographic memory. It maintains a mathematically rigorous set of beliefs about the subject. By dutch book arguments, the actions BoMAI takes should have a better outcome than the human.

**A Potential Problem**

The setup is designed such that** the reward for episode i is not causally connected to events outside the box that were caused by actions of BoMAI on episode i**. This is meant to ensure that BoMAI has no motivation to effect the outside world. Have a look at fig. 2 from the paper. This is true of the real world, but unfortunately, this setup does not guarantee that BoMAI

*believes*this to be the case, even if it can perfectly predict outcomes as detailed in theorems 1–4.

It says in the paper that

We mentioned just now that by the time the door to the room opens, the rewards from episode i are set in stone. In fact, they are set in silicon.

BoMAI could develop the following world model, that is perfectly consistent with what it observes. It could believe that it is possible to influence the human to take some action after they leave the box that causes the recorded reward for the current episode to be modified. This would give BoMAI cause to attempt to effect the outside world.

More generally, we want to prevent BoMAI from believing that it can influence the outside world in such a way that the outside world influences the reward of the current episode. This can be dealt with by a suitable choice of 𝓜 and priors over 𝓜.

# Model Class and Priors

So, let’s look at what 𝓜 actually looks like. The paper defines 𝓜 in a way inspired by *Solomonoff’s theory of inductive inference*. Recall 𝓜 is a family of stochastic functions 𝓗*×𝓐 ⇝ 𝓞×𝓡. Define 𝓜 to be the family of all possible programs that take in an element of 𝓗*×𝓐 and output an element of 𝓞×𝓡. Now you see why this is an intractable algorithm.

The priors are designed to favour world models in which **the outside world is frozen until the reward for that episode stops changing**. The outside world is allowed to change in the time between the reward no longer changing and the next episode. This solves the problem above, the AI cannot believe that it can effect the outside world in a useful way since it cannot effect the outside world until it’s too late.

To achieve this the prior favours programs in 𝓜 that rely on a finite amount of memory. And the smaller the memory it relies on, the higher weight it gets in the prior. The assumption is that the program would need a relatively large amount of memory to model a changing outside world, in comparison to just modelling changes inside the box. The programs with small memory that survive the learning process must use their precious memory to model the interior of the box so that they can accurately predict outcomes inside the box.

The general assumption, referred to as the *Space Requirements Assumption, *is that the required memory for models grows with the size of the environment. Seems reasonable to me.

To more precisely define the prior: the prior probability of a world model program that requires memory of length* l *is proportional to *βˡ*, where *β *∈ [0,1] is a tunable parameter.

The paper goes on to prove a final theorem:

*Theorem 5: Eventual Benignity.*With suitably tuned*β*, there exists some episode after which BoMAI assigns zero probability to world models that are not benign.

A benign world model is defined to be a world model in which the AI influencing the outside world in a way that influences the reward for the current episode is impossible.

# Conclusion

BoMAI is a formal algorithm + physical setup that is provably eventually unambitious and eventually at least human level intelligent. It may become useful in the future when the practical method of creating AGI become known. Pretty sick.