This works for levels of oversight that Hugh can actually provide. But it doesn’t seem to scale to really outlandish levels of oversight. Even on the ideal Bayesian perspective, the prior will have some probability of a breakdown at unimplementable levels of oversight (e.g. all adversarial hypotheses will probably break down at unimplementable levels).
I care about really extensive levels of oversight, because it seems much safer to have AI systems optimize against them.
For me the important question is whether we need to actually implement an oversight process that is sufficiently robust that we can do approval-direction, or whether it would suffice abstractly define such an oversight process. I’d be most interested in the idea from this post if it let us use a robust oversight process that is too expensive for us to actually implement.
If we can implement a sufficiently robust oversight process, then I feel pretty good about our prospects in general, though we might still want to use a scheme like this one to help get maximal use out of expensive training data.