Consider a sequence of humans.
Frame Of Stack
2

I agree that the meta problem is different from the object-level problem. So an overseer who is equipped to handle the object-level problem may not be able to handle the meta problem. (And even worse, a RL agent that is able to handle the object-level problem may not be able to handle the meta problem.)

But I think that the meta meta problem is pretty similar to the meta problem, and even more similar to the meta meta meta problem. Moreover, I think it’s fine to use basically the same strategy for all of them. And this strategy doesn’t need to be implemented perfectly at all (though it does need to clear some threshold of quality that many overseers may not be able to meet). So overall I don’t feel especially troubled, though I agree there is a lot to think about here and it’s not yet clear if it all fits together.

Convergence of levels of meta. Consider the infinitely meta level, of evaluating an action which is itself evaluating an action which is itself evaluating an action… It seems to me that humans can do OK at this game. It’s basically the same as the meta problem, except every time they would have dropped down to the object level, they stay at the meta level instead. The fact that they can’t actually trace the evaluations down to the object level doesn’t seem to be a fatal problem — probably they shouldn’t have been trying to do that most of the time anyway.

Moreover, it seems like human performance on any particular meta-meta-meta-…-meta level is lower-bounded by performance on the infinitely meta level. So I’m not too concerned about “running out of meta” completely, just about whether the overseer is able to oversee this infinitely meta task.

Necessary quality for first agent. If the oversight is weak, then the initial agent must be even weaker. But as long as it is in fact weaker, then it won’t actively pursue malicious goals. And it’s OK if the initial agent is very weak, so long as it is smart enough to satisfy the bootstrapping lemma, i.e. smart enough that it can control a computation somewhat smarter than itself.

That argument is too informal/imprecise/inaccurate, and the threshold may be higher than that description suggests. Thinking about this issue more carefully seems important, both because it is relevant, and because it reveals weaknesses in the argument. But I expect the overall shape of the conclusion to stay intact — that there is a threshold above which things will be OK, and it looks like the threshold is probably not too high.