Consider a powerful AGI which perfectly shares my values except that it has a flaw where if it ever sees a certain string of bits, it overwrites its utility function with whatever follows in its input, and it has a blind spot so it never notices this flaw. This AGI would be perfectly safe and aligned if it could choose its own destiny without interference from competing optimizers, since it would be unlikely to ever come across an exploit string naturally, but catastrophic if it ever had to look at inputs sent by other agents. What capacity is this AGI aligned at?
I’m skeptical of the Bootstrapping Lemma.
Wei Dai

Capacities aren’t totally ordered. This hypothetical AGI is aligned at a weird capacity, which is neither higher than low capacity nor lower than (relatively) high capacity. I agree that this notion is sufficiently murky that it may end up not being useful — so far I am inclined towards optimism.

The key question here is whether the bootstrapping process fixes the flaw. In order for the bootstrapping lemma to hold, a single iteration must fix the flaw, or at least make progress towards fixing it. (This will eventually be necessary to improve capacity further, once other improvements are used up.) I agree that this isn’t clear either and think it’s worth arguing about, but I’m again inclined towards optimism.

If the error is literally tied to sequences of bits, it probably wouldn’t survive bootstrapping. The bootstrapped AI will use the original AI to answer a question like “How good is it to take action {a} in state {s}?” where perhaps state s or action a involves the problematic bit string. Unless the bootstrapped AI opts to directly unpack the entire action a or state s, it won’t have a problem.

I don’t think that an AI should unpack the entire action a or state s; if nothing else because those inputs may be malicious, but even setting aside malice it’s probably not the most promising approach. I know that I described doing this in my “conservative argument” in the Bootstrapping Lemma section; I think that the spirit of that argument still holds (though maybe unpacking the whole input and giving it to the base AI is really a uniquely good thing to do).

More concretely, suppose that our AI will be incorrectly persuaded by a particular argument X but is otherwise reasonable. In particular, suppose that it has reasonable views about what constitutes a persuasive argument in the abstract, and what kind of analysis is appropriate for determining if an argument is reasonable.

Now suppose our AI is asked to process the question “is {x} a reasonable argument?” Our AI could directly unpack x and decide whether the argument is reasonable. But the point of bootstrapping is to instead break this question down into manageable parts and to answer each part separately, for example converting the argument into a useful format, listing the substantive claims and inferences, evaluating each step and claim, searching for the most plausible objections, evaluating the argument in the context of particular examples, evaluating the particular normative principles to which the argument appeals, and so on. None of these steps requires “looking at” the whole argument, and certainly not applying the same kind of cognitive machinery that we would apply if we were trying to quickly evaluate the whole argument.

It may be that our AI is not only wrong about the argument X itself, but also about the principles that would lead it to handle the argument correctly at the meta level. This problem could be fixed by applying the same reasoning at an even higher level, i.e. by asking why particular principles are correct or incorrect. Our AI could be wrong about these principles as well, but at some point I think it is no longer clear in what sense being persuaded by X is an error.