Agents teaching each other to manipulate is a serious concern for such a sequence of supervisors.
Paul Christiano
1

I think that the feasibility of avoiding goal-directed agents entirely is questionable. An approval-directed agent has to be powerful enough to meaningfully operate with the model of the overseer, which is very challenging if the overseer is a human. I’m not sure constructing such an agent “manually” (without using self-improvement) is feasible within the relevant timeframe. In other words, if it is much easier to arrive at superintelligence by goal-directed agents then it’s unlikely that a friendly projects which avoids goal-directed agents will outrace competing unfriendly projects.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.