I agree that the key feature of approval-directed agents is that the causal picture is:
effects in world ← action → evaluation → reward
action → effects in world → evaluation → reward.
I agree that the agent is still incentivized to try to attack the evaluation process.
The first response is that the overseer is more intelligent than the agent and hopefully has some transparency into its thinking — between those two, it’s easy for me to imagine hardening a system against overt attack.
I also agree that this doesn’t resolve the whole problem, since we need to get feedback from the actual environment rather than merely from the overseer’s imagining of the environment (e.g. the system needs to learn new empirical facts which the overseer does not initially know). Hopefully I’ll write about this issue sometime soon-ish.
Taking exploration as an example, the gist is: we can explore the environment by learning the overseer’s preferences over exploratory actions, rather than using a simple heuristic for exploration baked into our RL algorithm (we still need exploration inside our RL algorithm in order to learn to do anything, but we shouldn’t use this mechanism for exploring the physical environment).