I agree that I expect Bootstrap(A) to be able to outperform A in some reasonable sense, but I don’t see a reason that that should line up with “be aligned at capacity c’ > c”.

In particular I’m worried that there may be different value sets U and V, such that for high enough capacity aligned behaviour looks different, but below a threshold capacity k, a system which is aligned with U at capacity c<k may also be aligned with V at capacity c. So in trying to bootstrap up, you are always doing at least as well as what you started with, but that doesn’t mean you are fully aligned at a higher capacity.

Of course this is dependent on the pre-formal notion of ‘aligned at capacity c’. But I can’t immediately spot an interpretation that makes the problem go away. Also note that something like this is a general problem for choosing how a superintelligent agent should act — there may be many quite different actions it could take which would all look aligned to us.