The fact that they can’t actually trace the evaluations down to the object level doesn’t seem to be a fatal problem — probably they shouldn’t have been trying to do that most of the time anyway.
I agree that the meta problem is different from the object-level problem.
Paul Christiano
23
I don’t see the intuition there. My intuition is that they have to trace those consequences to keep things aligned; otherwise it’s just the skill of training an automatic programmer AI that is capable of training another. (That’s capable of training another, ad infinitum.)
Adding the above recursive skill and plain alignment together (in a subhuman agent) doesn’t obviously automatically yield recursive alignment.