ALBA: an explicit proposal for aligned AI
Paul Christiano

I’m skeptical of the Bootstrapping Lemma. First, I’m assuming it’s reasonable to think of A1 as a human upload that is limited to one day of subjective time, by the end of which it must have written down any thoughts it wants to save, and be reset. Let me know if this is wrong. If we repeatedly apply the Bootstrapping Lemma, it would imply that Bootstrap(Bootstrap(Bootstrap(…(A1)))) is aligned at an arbitrarily high capacity, but that implies Bootstrap(A1) is aligned at an arbitrarily high capacity, if given sufficient computing power, since A1 can perform the equivalent of recursive bootstraping within its computing environment. This implies that Bootstrap(A1) can safely train/supervise a superintelligence. But if I was A1, faced with some input generated by a superintelligence and with a powerful computing environment at my disposal, I wouldn’t know how to safely proceed. What if the SI has predicted my plan (or the plan created by a committee of instances of myself) for analyzing the input, found a flaw in that process and an exploit of the flaw? You sort of talk about this in the Malicious Inputs section, but say that you’re not too concerned. I’m not sure why.

More generally, I’m not sure about the usefulness of the “aligned at capacity c” concept. Consider a powerful AGI which perfectly shares my values except that it has a flaw where if it ever sees a certain string of bits, it overwrites its utility function with whatever follows in its input, and it has a blind spot so it never notices this flaw. This AGI would be perfectly safe and aligned if it could choose its own destiny without interference from competing optimizers, since it would be unlikely to ever come across an exploit string naturally, but catastrophic if it ever had to look at inputs sent by other agents. What capacity is this AGI aligned at? If you say high capacity, that seems wrong since it can’t safely supervise other agents. If you say low capacity, then humans also have low capacity (since in analogy with the flawed AI, we can be persuaded in different directions by different philosophical arguments that we see/hear, leading to different values/outcomes than if we try to figure things out by ourselves), and I don’t see how to increase such capacity, short of making a lot of metaphilosophical progress so that we know what kinds of arguments we *should* be persuaded by, or maybe (less likely) solving all the object level philosophical problems, then building an FAI that is immune to further philosophical arguments.