This deserves a long response, but I only wrote a middle-sized one, so sorry about that.
Point one: I suspect a part of the root cause of this disagreement traces to a basic disagreement we have about how far modern AI algorithms are, structurally, from the algorithms that are likely to appear in powerful AIs that can play a role in pivotal achievements or disasters. I expect powerful optimization processes to not have properties that resemble those of gradient descent / hill-climbing via local testing. The way you conclude absence of collusion is by looking at a greedy local optimization process that adopts a point mutation if it produces superior performance on the very next round; I worry that this is exactly the sort of property that very powerful optimization algorithms, or agents or subsystems *produced* by very powerful optimization algorithms, won’t have. Natural selection is greedy and local like this, yes. Natural selection is also supremely inefficient *and* it produced eusocial organisms *and* it didn’t put anything on the Moon until after it had output human beings who weren’t always greedy and local and eventually invented logical decision theories. A lot of our disagreement about your current particular, exact approach to approval-based agents is because I don’t think you can, in practice, do anything remotely like scaling up deep learning to do supervised learning of how to predict which science papers a human being would write.
Point two: The class of catastrophe I’m worried about mainly happens when a system design is supposed to contain two consequentialists that are optimizing for different consequences, powerfully enough that they will, yes, backchain from Y to X whenever X is a means of influencing or bringing about Y, doing lookahead on more than one round, and so on. When someone talks about building a system design out of having two of *those* with different goals, and relying on their inability to collude, that is the point at which I worry that we’re placing ourselves into the sucker’s game of trying to completely survey a rich strategic space well enough to outsmart something smarter than us.
The problem doesn’t necessarily arise if we have a system with a memory subsystem optimized for allocating memory well, and a sensory subsystem optimized for processing memory data. Unless there’s two actual consequentialist subagents in there, and then, yes, we might have a problem on our hands when we try to scale the system. The problem isn’t from optimizing two different parts of the system on a small scale using two different local objectives, the problem is from relying on a system design that has two different optimizers with different global objectives. That’s the point at which we have to start worrying if the optimizers are smart.
This, of course, goes back to what I suspect to be the central disagreement in point one. If you imagine that deep belief networks scale up to do all the interesting things while maintaining their greedy local properties, you can imagine that you get all the nice things without ever trying to build a system out of opposed consequentialists. I worry that, first, things that are greedy and local will never do the pivotal things we want them to do, and second, I worry that this is the equivalent of imagining water that’s wet but not wet. If by hook or by crook you can get a system to predict what chess moves you’ll make and what science papers you write, it contains consequentialist capabilities in it somewhere and the difficulties of managing powerful consequentialists are likely to come into play.