Presumably the vulnerable A_n will still end up looking at some functions of a and s, and if A_n+1 can predict what those functions are, it can work backwards and figure out what a and s need to be such that A_n ends up looking at the exploit string.
>I don’t think that an AI should unpack the entire action a or state s
Wei Dai

If A has vulnerabilities then B may have some different vulnerabilities. For the particular argument you gave, note that actually tracing back through B’s computation in order to find something that leads to the attack is strictly harder than finding the attack itself, and is certainly not something that A(n+1) can do (since it is only slightly smarter than A(n), not smart enough to simulate many steps of A(n)). You will only really have trouble if there is some special structure in B’s computation that A(n+1) can exploit.

As long as finding the exploit becomes harder (and it is already hard enough that the only way it would arise is if someone was intentionally optimizing to find it), then that satisfies my intuitive notion of capacity increase, since it shrinks the set of worlds where the exploit arises (and if you keep iterating that process you will eventually end up with no remaining exploit).

More generally, I expect to be able to “filter” A’s inputs so that it doesn’t look at any data more complex than it could have generated itself (see discussion here). This still leaves open the question of whether A could generate an attack on itself, which is very closely related to this problem.

Like what you read? Give Paul Christiano a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.