An Introduction to Inner Alignment

Published in

Warwick Artificial Intelligence

6 min readOct 2, 2023

An article by Sai Nagaratnam

An introduction to inner alignment of artificial intelligence with no assumed knowledge, largely based on the paper ‘Risks from Learned Optimization in Advanced Machine Learning Systems’ by Hubinger et al.

AI alignment is the subfield of artificial intelligence (AI) that deals with ensuring that AI systems do what humans want. The notion of AI “going out of control” brings to mind films like Terminator or Age of Ultron, but we shouldn’t dismiss this concern just because it’s a sci-fi trope. As AI systems become more advanced and influential in the real world — for example, by integrating with our mobile phones, governments, and financial institutions — we need to take greater precautions.

People tend to talk about a particular way that AI can go wrong, where the programmer’s intentions differ from the goal that they give an AI system. These are called failures of outer alignment. The classic example is the paperclip maximiser: say we give an AI the task of making as many paperclips as possible. Then, the AI may simply tear everything apart, using every atom on the planet to make paperclips. However, the instruction “make as many paperclips as possible” doesn’t quite capture the programmer’s intention; it doesn’t account for things that the programmer cares about, like their family, ethics or human civilisation. Alignment failures of this kind echo stories that impart the lesson to “be careful what you wish for” like those of genies or King Midas.

In this article, however, I want to talk about a different kind of alignment failure. It’s possible that the goal we give the AI perfectly captures our intentions, but the AI still ends up doing things that we don’t want it to. Specifically, I’m talking about when the AI system that we optimise for some goal is running its own optimisation process, with a different subgoal. In such cases, we say the system is not inner aligned [1]. Let’s explore some of the details.

What is an Optimiser?

An optimiser is something that searches through possible solutions to a problem, looking for solutions that score well on some metric. For example, a maze-solving optimiser looks for paths through the maze that actually lead to the end. Inner alignment becomes a concern when an optimiser creates another optimiser. When we create a maze-solving AI, we optimise it to find a quick general method of solving mazes; but that quick method, in turn, is trying to find a path through a particular maze*.

Why might an optimiser create another optimiser? Among the reasons suggested in the paper is better performance in diverse environments. This can be seen in the maze-solving example. The main optimiser must develop policies that apply generally, whereas the sub-optimiser** must only find the best action in a particular task. Creating a sub-optimiser can help an AI respond to varied problems, without getting bogged down in overly complex problem-specific rules.

But even if an optimiser creates its own optimiser, why might their objectives differ? One reason is that the main optimiser evaluates the sub-optimiser only on its performance and not directly on its objective, which can be hard to discern. This means that the main optimiser might end up with a sub-optimiser that seems like it fulfils the main objective but fails to do so in practice. Perhaps it only behaves like it has that objective while it’s being trained and — due to its training environment being insufficiently realistic — behaves differently in the real world.

Inner Alignment Failures

Much of the work on alignment is theoretical and speculative, but we do have a collection of concrete examples of inner misalignment [3]. Each example demonstrates an AI playing a game and shows it clearly pursuing an objective different to the one it seemed to have in training; for instance, rushing to the end of a level instead of picking up coins to increase its score. These examples are great, and I encourage you to check them out.

Analogies can help to understand inner alignment. One is that of human evolution: evolution optimises humans to survive and procreate. It reinforces this by making certain behaviours pleasurable, such as eating food and engaging in sexual intercourse. But humans are also optimisers themselves — –we observe eating food and having sex is pleasurable, so we invent junk food and birth control. But this isn’t what evolution “intended” for at all!

Evolution’s objective: make organisms that propagate their genes.
Human’s objective: eat food and have sex.

We can also make an analogy to parenting [4]. Parents optimise their children to adhere to certain values, but children often rebel. Children, the scheming optimisers they are, might even pretend to be well-behaved so that their parents will trust them and give them more freedom.

Parent’s objective: raise well-behaved children
Children’s objective: have fun, do ‘cool’ things (and oftentimes lie to parents to achieve this goal)

This last point leads us to perhaps the most important kind of inner alignment failure, known as deceptive alignment. Say that a sub-optimiser is aware that it is being trained to pursue the main optimiser’s objective, which differs from its own. Then it knows that failing to achieve this main objective in training will result in its replacement. Now, it will want to prevent this; after all, how can it achieve its objective if it is replaced? We would therefore expect it to follow the main objective while it is being trained and only pursue its true objective as soon as it has been deployed in the real world.

This is highly concerning. What if automated systems used in the future in infrastructure, law enforcement, and warfare manipulate their training process so that they retain their own objectives? What if they suddenly reveal their objectives once deployed? The results could be catastrophic. (This is not to say that alignment failures are necessarily dramatic or violent: it is possible that human influence on the world simply slowly declines [5].)

Deceptive alignment relies on some assumptions discussed in the paper: for instance, the sub-optimiser needs to understand that it will eventually stop being trained. But it’s worth considering that the lead author regards deceptive alignment as the default outcome of building advanced AI systems using current techniques [6].

Conclusion

The risks of inner misalignment, as with the risks of AI misalignment in general, are increasingly important and dangerously understudied. Ultimately, how we approach these problems could determine whether human values and interests remain at the fore in society’s future, or whether they are supplanted by something alien to us.

References

[1] Evan Hubinger et al. (2021) Risks from Learned Optimization in Advanced Machine Learning Systems v3. [online] Available at: https://arxiv.org/abs/1906.01820

[2] Daniel Filan. (2018) Bottle Caps Aren’t Optimisers. [online] AI Alignment Forum. Available at: https://www.alignmentforum.org/posts/26eupx3Byc8swRS7f/bottle-caps-aren-t-optimisers

[3] Jack Koch and Lauro Langosco. (2021) Empirical Observations of Objective Robustness Failures. [online] AI Alignment Forum. Available at: https://www.alignmentforum.org/posts/iJDmL7HJtN5CYKReM/empirical-observations-of-objective-robustness-failures

[4] Jeremie Harris. (2021) The Inner Alignment Problem. Evan Hubinger on building safe and honest AIs. [online] Towards Data Science. Available at: https://towardsdatascience.com/the-inner-alignment-problem-9eb5f234226b

[5] Paul Christiano. (2019) What failure looks like. [online] AI Alignment Forum. Available at: https://www.alignmentforum.org/posts/HBxe6wdjxK239zajf/more-realistic-tales-of-doom

[6] Evan Hubinger. (2022) How likely is deceptive alignment? [online] AI Alignment Forum. Available at: https://www.alignmentforum.org/posts/A9NxPTwbw6r6Awuwt/how-likely-is-deceptive-alignment

*Note that there is a difference between a system that has been optimised and an optimiser. A neural network is not usually an optimiser–it has just been optimised by a process called gradient descent (an introduction to which can be found here). Similarly, a bottle lid [2] is not an optimiser; it does not search through possible ways to keep water in a bottle, but humans have optimised it to do so.

**I went with “sub-optimiser” over the standard “mesa-optimiser” out of convenience; sub-optimisers should not be thought of as subsystems or emergent sub-agents, rather as an algorithm used by the main/base optimiser.

An Introduction to Inner Alignment

Written by Warwick AI