In general, any fixed system will be aligned with different values at different capacities. The only limit on this phenomenon is that some values are incompatible, in the sense that effectively pursuing U is bad for V. Then any system aligned with and informed about U is not aligned with V.
The bootstrapped system is not intended to be “fully” aligned in any sense, just to be aligned at a higher capacity than it started. That is, the decisions of B should more effectively serve the user’s interests than do the decisions of A. This is what I’m arguing for here.
For low capacities, agents can’t effectively distinguish between similar values U and V, and so being aligned with U automatically implies being aligned with V unless they are distinguished by the environment (and the dependence on the environment is sufficiently simple to be understood by the low-capacity agent).
The definition of “aligned” includes optimally motivated but not optimally informed — if there is no information to distinguish some incompatible U and V, then being aligned with U is intended to mean pursuing some compromise between them, since that is the best you could do if you wanted to maximize U but weren’t told anything else. This is certainly very fuzzy and incompletely specified (and you could read my description either way).
In this case, the agent does have access to lots of extra information, since it can consult the user. So leaving out information doesn’t seem to be a big deal — provided the weak agents have enough information to know that deferring to the user is aligned with U.
I don’t know if it matters whether there are many actions that “look aligned to us.”