Self-Improving AI: The Future of Computation
November 1st, 2007
Today’s software has been criticized for being buggy and insecure, both too expensive and time-consuming to create. As computers become more complex and parallel, today’s development paradigm appears increasingly incapable of matching the pace of accelerating technological change. Steve Omohundro of Self-Aware Systems describes in his October 24, 2007 Stanford University Computer Systems Colloquium a new approach to “software synthesis,” in which artificially intelligent machines take over many of the tasks of software development. The approach is based on “self-improving systems” which improve themselves by learning from their own operation. These same systems have the potential to develop radically improved hardware based on nanotechnology, leading to profound technological and social consequences.
The following transcript of Steve Omohundro’s EE380 Computer Systems Colloquium delivered at Stanford University October 24, 2007 has been edited for clarity by the author.
Self-Improving AI: The Future of Computation
We’re going to cover a lot of territory today and it may generate some controversy. I’m happy to take short questions while we’re going through it, but let’s hold the more controversial ones until the end.
Let’s start by looking at the state of today’s computer software. On June 4th, 1996, an Ariane 5 rocket worth $500 million blew up 40 seconds after takeoff. It was later determined that this was caused by an overflow error in the flight control software as it tried to convert a 64-bit floating point value into a 16-bit signed-register.
In November 2000, 28 patients were over-irradiated in the Panama City National Cancer Institute. 8 of these patients died as a direct result of the excessive radiation. An error in the software which computes the proper radiation dose was responsible for this tragedy.
On August 14, 2003, the largest blackout in U.S. history shut off power for 50 million people in the Northeast and in Canada and caused financial losses of over $6 billion. The cause turned out to be a race condition in the General Electric software that was monitoring the systems.
Microsoft Office is used on 94% of all business computers in the world and is the basis for many important financial computations. Last month it was revealed that Microsoft Excel 2007 gives the wrong answer when multiplying certain values together.
As of today, the Storm Worm trojan is exploiting a wide range of security holes and is sweeping over the internet and creating a vast botnet for spam and denial of service attacks. There is some controversy about exactly how many machines are currently infected, but it appears to be between 1 and 100 million machines. Some people believe that the Storm Worm Botnet may now be the largest supercomputer in the world.
We had a speaker last quarter who said that two out of three personal computers are infected by malware.
Wow! Amazing! Because of the scope of this thing, many researchers are studying it. In order to do this, you have to probe the infected machines and see what’s going on. As of this morning, it was announced that apparently the storm worm is starting to attack back! When it detects somebody trying to probe it, it launches a denial of service attack on that person and knocks their machine off the internet for a few days.
If mechanical engineering were in the same state as software engineering, nobody would drive over bridges. So why is software in such a sorry state? One reason is that software is getting really, really large. The NASA space shuttle flight control software is about 1.8 million lines of code. Sun Solaris is 8 million lines of code. Open Office is 10 million lines of code. Microsoft Office 2007 is 30 million lines of code. Windows Vista is 50 million lines of code. Linux Debian 3.1 is 215 million lines of code if you include everything.
But programmers are still pretty darn slow. Perhaps the best estimation tool available is Cocomo II. They did empirical fits to a whole bunch of software development projects and they came up with a simple formula to estimate the number of person months required to do a project. It has a few little fudge factors for how complex the project is and how skilled the programmers are. Their website has a nice tool where you can plug in the parameters of your project and see the projections. For example, if you want to develop a 1million line piece of software today, it will take you about 5600 person months. They recommend using 142 people working for three years at a cost of $89 million. If you divide that out you discover that average programmer productivity for producing working code is about 9 lines a day.
Why are we so bad at producing software? Here are a few reasons I’ve noticed in my experience. First, people aren’t very good at considering all the possible execution paths in a piece of code, especially in parallel or distributed code. I was involved in developing a parallel programming language called pSather. As a part of its runtime, there was a very simple snippet of about 30 lines of code that fifteen brilliant researchers and graduate students had examined over a period of about six months. Only after that time did someone discover a race condition in it. A very obscure sequence of events could lead to a failure that nobody had noticed in all that time. That was the point at which I became convinced that we don’t want people determining when code is correct.
Next, it’s hard to get large groups of programmers to work coherently together. There’s a classic book The Mythical Man Month that argues that adding more programmers to a project often actually makes it last longer.
Next, when programming with today’s technology you often have to make choices too early. You have to decide on representing a certain data structure as a linked list or as an array long before you know enough about the runtime environment to know which is the right choice. Similarly, the requirements for software are typically not fixed, static documents. They are changing all the time. One of the characteristics of software is that very tiny changes in the requirements can lead to the need for a complete reengineering of the implementation. All these features make software a really bad match with what people are good at.
The conclusion I draw is that software should not be written by people! Especially not parallel or distributed software! Especially not security software! And extra especially not safety-critical software! So, what can we do instead?
The terms “software synthesis” and “automatic programming” have been used for systems which generate their own code. What ingredients are needed to make the software synthesis problem well-defined? First, we need a precisely-specified problem. Next, we need the probability distribution of instances that the system will be asked to solve. And finally, we need to know the hardware architecture that the system will run on. A good software synthesis system should take those as inputs and should produce provably correct code for the specified problem running on the specified hardware so that the expected runtime is as short as possible.
There are a few components in this. First, we need to formally specify what the task is. We also need to formally specify the behavior of the hardware we want to run on. How do we do that? There are a whole bunch of specification languages. I’ve listed a few of them here. There are differences of opinion about the best way to specify things. The languages generally fall into three groups corresponding to the three approaches to providing logical foundations for mathematics: set theory, category theory, and type theory. But ultimately first-order predicate calculus can model all of these languages efficiently. In fact, any logical system which has quickly checkable proofs can be modeled efficiently in first-order predicate calculus, so you can view that as a sufficient foundation.
The harder part, the part that brings in artificial intelligence, is that many of the decisions that need to made in synthesizing software have to be made in the face of partial knowledge. That is, the system doesn’t know everything that is coming up and yet has to make choices. It has to choose which algorithms to run without necessarily knowing the performance of those algorithms on the particular data sets that they are going to be run on. It has to choose what data structures to model the data with. It has to choose how to assign tasks to processors in the hardware. It has to decide how to assign data to storage elements in the hardware. It has to figure out how much optimization to do and where to focus that optimization. Should it compile the whole thing at optimization -05? Or should it highly optimize only the parts that are more important? How much time should it spend actually executing code versus planning which code to execute? Finally, how should it learn from watching previous executions?
The basic theoretical foundation for making decisions in the face of partial information was developed back in 1944 by von Neumann and Morgenstern. Von Neumann and Mergenstern dealt with situations in which there are objective probabilities. In 1954, Savage and in 1963, Anscombe and Aumann extended that theory to dealing with subjective probabilities. It has become the basis for modern microeconomics. The model of a rational decision-maker that the theory gives rise to is sometimes called “Homo economicus.” This is ironic because human decision-making isn’t well described by this model. There is a whole branch of modern economics devoted to studying what humans actually do called behavioral economics. But we will see that systems which self-improve will try to become as close as possible to being rational agents because that is how they become the most efficient.
What is rational economic behavior? There are several ingredients. First, a rational economic agent represents its preferences for the future, by a real valued utility function U. This is defined over the possible futures, and it ranks them according to which the system most prefers. Next, a rational agent must have beliefs about what the current state of the world is and what the likely effects of its actions are. These beliefs are encoded in a subjective probability distribution P. The distribution is subjective because different agents may have a different view of what the truth is about the world. How does such an agent make a decision? It first determines the possible actions it can take. For each action, it considers the likely consequences of that action using its beliefs. Then it computes the expected utility for each of the actions it might take and it chooses the action that maximizes its expected utility. Once it acts, it observes what actually happens. It should then update its beliefs using Bayes’ theorem.
In the abstract, it’s a very simple prescription. In practice, it is quite challenging to implement. Much of what artificial intelligence deals with is implementing that prescription efficiently. Why should an agent behave that way? The basic content of the expected utility theorem of von Neumann, Anscombe and Aumann is that if an agent does not behave as if it maximizes expected utility with respect to some utility function and some subjective probability distribution, then it is vulnerable to resource loss with no benefit. This holds both in situations with objective uncertainties, such as roulette wheels, where you know the probabilities, and in situations with subjective uncertainties, like horse races. In a horse race, different people may have different assessments of probabilities for each horse winning. It is an amazing result that comes out of economics that says a certain form of reasoning is necessary in order to be an effective agent in the world.
How does this apply to software? Let’s start by just considering a simple task. We have an algorithm that computes something, such as sorting a list of numbers, factoring a polynomial, or proving theorems. Pick any computational task that you’d like. In general there is a trade-off between space and time. Here, let’s just consider the trade-off between the size of the program and the average execution time of that program on a particular distribution of problem instances. In economics this curve defines what is called the production set. All these areas above the curve are computational possibilities, whereas those below the curve are impossible. The curve defines the border between what is possible and what is impossible. The program which is the most straightforward implementation of the task lies somewhere in the middle. It has a certain size and a certain average execution time. By doing some clever tricks, say by using complex data compression in the program itself, we can shrink it down a little, but then uncompressing at runtime will make it a little bit slower on average. If we use really clever tricks, we can get down to the smallest possible program, but that costs more time to execute.
Going in the other direction, which is typically of greater interest, because space is pretty cheap, we give the program more space in return for getting a faster execution time. We can do things like loop unrolling, which avoids the some of the loop overhead at the expense of having a larger program. In general, we can unfold some of the multiple execution paths, and optimize them separately, because then we have more knowledge of the form of the actual data along each path. There are all sorts of clever tricks like this that compilers are starting to use. As we get further out along the curve, we can start embedding the answers to certain inputs directly in the program. If there are certain inputs that recur quite a bit, say during recursions, then rather than recomputing them each time, it’s much better to just have those answers stored. You can do that at runtime with the technique of memoization, or you can do it at compile time and actually store the answers in the program text. The extreme of this is to take the entire function that you are trying to compute and just make it into a big lookup table. So program execution just becomes looking up the answer in the table. That requires huge amounts of space but very low amounts of time.
What does this kind of curve look like in general? For one thing, having more program size never hurts, so it’s going to be a decreasing (or more accurately non-increasing) curve. Generally the benefit we get by giving a program more space decreases as it gets larger, so it will have a convex shape. This type of relationship between the quantities we care about and the resources that we consume, is very common.
Now let’s say that now we want to execute two programs as quickly as possible. We can take the utility function to be the negative of total execution time. We’d like to maximize that while allocating a certain amount of fixed space S between these two programs. How should we do that? We want to maximize the utility function subject to the constraint that the total space is S. If we take the derivative with respect to the space we allocate to the first program and set that to zero, we find that at optimal space allocation the two programs will have equal marginal speedup. If we give them a little bit more space, they each get faster at the same rate. If one improved more quickly, it would be better to give it more space at the expense of the other one. So a rational agent will allocate the space to make these two marginal speedups equal. If you’ve ever studied thermodynamics you’ve seen similar diagrams where this is a piston between two gases. In thermodynamics, this kind of argument shows that the pressure will become equilibrated between the chambers. It’s a very analogous kind of a thing here.
That same argument applies in much greater generality. In fact it applies to any resource that we can allocate between subsystems. We have been looking at program size, but you can also consider how much space the program has available while it is executing. Or how to distribute compilation time to each component. Or how much time should be devoted to compressing each piece of data. Or how much learning time should be devoted to each learning task. Or how much space should be allocated for each learned model. Or how much meta-data about the characteristics of programs should be stored. Or how much time should you spend proving different theorems. Or which theorems are worthy of storing and how much effort should go into trying to prove them. Or what accuracy should each computation be performed at. The same kind of optimization argument applies to all of these things and shows that at the optimum the marginal increase of the expected utility as a result of changing any of these quantities for every module in the system should be the same. So we get a very general “Resource Balance Principle”.
While that sounds really nice in theory, how do we actually build software systems that do all this? The key insight here is that meta-decisions, decisions about your program, are themselves economic decisions. They are choices that you have to make in the face of uncertain data. So a system needs to allocate its resources between actually executing its code and doing meta-execution: thinking about how it should best execute and learning for the future.
You might think that there could be an infinite regress here. If you think about what you are going to do, and then think about thinking about what you are going to do, and then think about thinking about thinking about what you are going to do… but, in fact, it bottoms out. At some point, actually taking an action has higher expected utility than thinking about taking that action. It comes straight out of the underlying economic model that tells you how much thinking about thinking is actually worthwhile.
Remember I said that in the software synthesis task, the system has to know what the distribution of input instances are. Generally, that’s not something that is going to be handed to it. It will just be given instances. But that’s a nice situation in which you can use machine learning to estimate the distribution of problem instances. Similarly, if you are handed a machine, you probably need to know the semantics of the machine’s operation. You need to know what the meaning of a particular machine code is, but you don’t necessarily have to have a precise model of the performance of that machine. That’s another thing that you can estimate using machine learning: How well does your cache work on average when you do certain kinds of memory accesses? Similarly, you can use machine learning to estimate expected algorithm performance.
So now we have all the ingredients. We can use them to build what I call “self-improving systems.” These are systems which have formal models of themselves. They have models of their own program, the programming language they’re using, the formal logic they use to reason in, and the behavior of the underlying hardware. They are able to generate and execute code to solve a particular class of problems. They can watch their own execution and learn from that. They can reason about potential changes that they might make to themselves. And finally they can change every aspect of themselves to improve their performance. Those are the ingredients of what I am calling a self-improving system.
You might think that this is a lot of stuff to do, and in fact it is quite a complex task. No systems of this kind exist yet. But there are at least five groups that I know of who are working on building systems of this ilk. Each of us has differing ideas about how to implement the various pieces.
There is a very nice theoretical result from 2002 by Marcus Hutter that gives us an intellectual framework to think about this process. His result isn’t directly practical, but it is interesting and quite simple. What he showed is that there exists an algorithm which is asymptotically within a factor of five of the fastest algorithm for solving any well-defined problem. In other words, he has got this little piece of code in theory and you give me the very best algorithm for solving any task you like, and his little piece of code if you have a big enough instance asymptotically will run within a factor of five of your best code. It sounds like magic. How could it possibly work? The way it works is that the program interleaves the execution of the current best approach to solving the problem with another part that searches for a proof that something else is a better approach. It does the interleaving in a clever way so that almost all of the execution time is spent executing the best program. He also shows that this program is one of the shortest programs for solving that problem.
That gives us the new framework for software. What about hardware? Are there any differences? If we allow our systems to not just try and program existing hardware machines but rather to choose the characteristics of the machines they are going to run on, what does that look like? We can consider the task of hardware synthesis in which, again, we are given a formally specified problem. We are also again given a probability distribution over instances of that problem that we would like it to solve, and we are given an allowed technology. This might be a very high level technology, like building a network out of Dell PCs to try and solve this problem, or it might go all the way down to the very finest level of atomic design. The job of a hardware synthesis system is to output a hardware design together with optimized software to solve the specified problem.
When you said “going down to a lower level” like from Dell PCs, did you mean to the chip level?
Yes, you could design chips, graphics processors, or even, ultimately, go all the way down to the atomic level. All of those are just differing instances of the same abstract task.
Continued in “Self-Improving AI: Social Consequences.”