Intuitively, Bootstrap(A) will behave “like a copy of A that gets to think for longer.” For example, if I imagine myself in the position of A, it seems clear that Bootstrap(me) is significantly smarter than I am, to roughly the same degree that thinking longer makes me smarter.
ALBA: an explicit proposal for aligned AI
Paul Christiano
69

Consider a sequence of humans. The first human is moral, but has poor ability to program. This human could train the final agent A_N, providing feedback on choices, but could not provide reliable training for any earlier A_i to serve as effective B_i because the A_i trained by this person can’t use the programming environment correctly when bootstrapped. The second human is moral and can program, but cannot do any meta-programming. This human can train A_N, and can train A_(N-1) to serve a good bootstrapped agent for training A_N, but cannot train any earlier agent reliably.

The sequence continues, adding one layer of meta each time.

The point is that according to this model, humans “run out of meta" after some number of levels. If so, the lemma may hold strong for several iterations and then fail suddenly, beyond say i=5.