I think this is a hard case, I’m not sure how to deal with it. (This case also seems to be a challenge for the overall control problem, not merely for security amplification.) I have become more pessimistic after thinking it through somewhat more carefully.
(ETA: obviously all cases have some of this problem, so I really mean this is a case that isolates a potentially fundamental difficulty.)
For concreteness, suppose there exist some inputs which causes a buffer-overflow-analogy in the human brain. Perhaps the simplest way to use this attack to get a high score is to overwrite a tiny piece of the brain’s computation with a consequentialist which will then take over their brain and cause the human to report a high score (in order to bootstrap itself into existence). This is a kind of silly case, but makes it clear to me that there isn’t going to be a principled argument ruling out the problem: there are probably some benign brains for which this is actually the best attack.
I find it helpful to separate out two different desiderata:
- Preserve benignity
- Compete with the benchmark AI
For ALBA, the benchmark is using gradient descent to train a translator directly, with an arbitrary reward function e.g. determined by human evaluations of translation quality. Our goal is to remain benign, while translating at least that well (assuming that the human wants things translated).
I agree that in order to compete with the benchmark we need to actually give an optimized sentence to the human brain. But that necessarily introduces malignity, and meta-execution doesn’t have any way to stop it from spreading. (Reliability amplification can stop malignity, but not if it appears with high probability.)
I had been running through other challenges for meta-execution more carefully and so far it seemed OK. This one seems more plausible fatal.
I have also been thinking about IRL+metareasoning recently as an approach to capability amplification, more along the lines of what seems to be Stuart Russell’s agenda. I will think a little bit about whether that could also plausibly work for security amplification. If that doesn’t look good (and it probably won’t) I will have to step back and think about the situation more broadly.