What is AI Alignment?
This blog is part of a series
The first part is: AI — The Control Problem
In an earlier blog, I discussed the control problem — The challenge we face controlling a machine which thinks in an entirely different way to us and may well be far more intelligent than we are. Even if we have a perfect solution to the control problem we are left with a second problem, what should we ask the AI to do, think and value?
This problem is AI alignment.
The AI is likely to be much faster than us (if it isn’t, why not spin up a few more servers? Then it will be!) and it’s not viable or valuable for us to confirm every decision it wants to make. What we want are a set of rules or principles that an AI can refer to, to make a decision itself, knowing that by following those rules its action will be aligned with what humans would want.
These rules systems for AI fall into two primary categories — Direct normativity, and indirect normativity.
Direct Normativity
With direct normativity, we offer AI a set of rules to follow. The most famous of these rules are Isaac Asimov’s laws of robots, which were a great… foundation… but fall apart, as we can see from the stories Asimov wrote.
A more modern example of this is Nick Bostrom’s paperclip maximiser. Take an AI, and make its utility function, its core value, to create paperclips. This sounds harmless, and an easy way of asking a controlled AI make paperclips. We don’t want it to do exactly what the humans would do; we already have factories churning out paperclips. The point of us employing an AI is to find new, better, more efficient processes of making paperclips, and make paperclips it shall. After it’s done making all the paperclips we could possibly want, will it stop? No. It has been hardwired to make paperclips, so it will keep making paperclips, even after it has used all of the material given to it, all the material on earth, even turning material outside of our planet and solar system into paperclips if it can. Those humans? They’ve made of matter too, which would look much nicer if it were a paperclip. If anything, the humans are especially important to turn into paperclips, because in their human form they might decide they don’t want any more paperclips. The AI wants to give itself the highest possible probability of making paperclips, and if it predicts that humans may not want this to happen there’s a chance they’d get in the AIs way, so it’s rational for the AI to eradicate humans to improve its probability of making more paperclips.
The adaptation of this is to tell a paperclip to make a finite number of paperclips, say 1,000. Unfortunately, this doesn’t work so well either. Assuming this AI uses Bayesian probability to gauge certainty, it can never be 100% or 0% sure of anything. After having made 1,000 paperclips, it’ll continue to check whether it made precisely 1,000, not 999 and that they are all exactly what a paperclip should be. It’ll continue to use all the same resources it would have used to turn into paperclips to check and re-check this, because the AIs terminal value, it’s whole being is about ensuring precisely 1,000 paperclips were made. Besides, it’s not like it’s got anything better to do.
Applied to alignment, what we want to do is find a sequence of rules which an AI can follow to the letter and be sure it is doing what humanity would desire. The systems we propose currently all seem to have loopholes, which if we are capable of finding with our human level of intelligence, will be child’s play for a beyond human level intelligence to find and exploit.
Adding more and more rules doesn’t sound appealing either. We may compare this to tax law, which has vast tomes of rules. Strangely though, we seem to be surrounded by people, even huge corporations paying no tax. Far from being more secure, it seems more rules offer more loopholes to be exploited. Under direct normativity, we have no reason to expect an AI to follow the spirit of the rules, only to follow them to the letter.
Indirect Normativity
So, enter the alternative. Indirect normativity doesn’t ask AI to follow an explicit and static set of rules. Instead, it gives AI a framework for finding the values itself, often asking it to do what we ‘mean.’ Eliezer Yudkowski offers a framework called ‘Coherent Extrapolated Volition,’ or CEV for short.
In poetic terms, our coherent extrapolated volition is our wish if we knew more, thought faster, were more the people we wished we were, had grown up farther together; where the extrapolation converges rather than diverges, where our wishes cohere rather than interfere; extrapolated as we wish that extrapolated, interpreted as we wish that interpreted — Eliezer Yudkowski, Machine Intelligence Research Institute
This method has one further layer of abstraction to direct normativity and asks the AI to create its own rules to follow based on an honest interpretation of an earlier statement. We’re working here on the assumption that we have solved the control problem, so the AI is not able to modify that first statement for any nefarious needs.
An interesting thing which becomes apparent with indirect normativity is we don’t really want an AI to do what we would do, or value precisely what we would value. Imagine an AI created in Viking times, where we had managed to code in exactly what our values were at the time. We’d probably end up with a very physically strong and violent AI (sorry if Vikings weren’t like that, history isn’t my strong suit!). We have different values today, but those values may not reflect the people we will become in 1,000 years or even the people that we will become immediately after the advent of superhuman intelligent AI. We value human productivity today for example, and that’ll become much less important. We also value short-term gains, enhancing productivity at the expensive of the environment, and that is unlikely to suit our future society at all. What we want is an AI which can create for itself a value system that will sum up our future needs, without sacrificing the needs or values of society today.
It’s tough to predict what an AI would do in this situation. As we’ve said, control in this hypothetical scenario has been solved, and we’re asking the AI to do as we would do in a future where our values converge and do what we mean in this statement. It might be that this future version of us that it predicts is OK with AI manipulating our own value systems for our own good, even if that’s not something we would want today because it is something that we may value in the future.
How do we find the right values?
The short answer, we don’t know.
Questions on ethics and morality have been researched by philosophers for thousands of years and really have no satisfactory answer. We don’t even have complete consensus on the most basic questions — Some believe there is a universal ethical framework that we can discover, others see ethics as an average of our current ideals, which change over time. Just because AI is a different subject to apply the question to, doesn’t mean we’re going to find the answers any more easily.
One solution proposed is to create a genie type AI and ask it what values it should be given, or what indirect instructions we should use to achieve desirable values. The problems to this approach can be imagined if we used this type of AI to solve the control problem. If we are using it to solve the control problem, the genie itself cannot be completely controlled. We may try to limit its access to data, and give it a very restrictive communication system, like only being able to answer yes or no to questions, but it’s extremely difficult for a human to predict the methods that an intelligence greater than ourselves might take to escape the cage we build for it. Applied to alignment, the problem is how we would know whether the answer given by an un-aligned AI is indeed inline with human values, given our expectation that we need an answer different to anything we have thought of before.
Eliezer Yudkowski draws a parallel between this type of AI and a chess-playing computer. In 1950 Claude Shannon came up with a theoretically perfect chess-playing algorithm, and from there it took 47 years for Deep Blue to beat Gary Kasparov. What Shannon needed was the computing power to make his algorithm come to life. AI alignment though is in the pre-Shannon days. Even with unlimited compute, we do not know what algorithm we would run.
It’s possible, with the processing power we have available today, that the gap between the theoretically perfect algorithm and the functioning prototype would be much shorter, but we first need to figure out some fundamental questions to put this processing power to good use.