This sounds roughly sensible.
Paul Christiano

A couple of ideas for avoiding the false positives (with some negative tradeoffs for safety):

Limiting the advice set

  • Have the advice set R be only consist of sets which the overseer B has thought about and chosen to include as being relevant + safe. This could include advice for thinking about specific failure modes, sets of strings which indicate specific algorithmic procedures to try, etc. This compromises safety by potentially leaving out “unknown unknowns” (however, you could address some of the advice set to instruct B to consider things not in the advice set).

Limiting false positives in large advice sets

  • Prior to the algorithm, pick a “safe set” S of (input, action) pairs which seem, on some reflection, that they should obviously not be forbidden. Then only punish actions which have a larger set of values r for which F(a,r) is true than, say, the mean of all actions in S (can be done in the scheme where you evaluate all advice, or something similar can be done by setting the evaluation of the null action to equal the evaluation of some random element in S). This means that if A starts discovering false positive reasons for actions in S being unsafe, they will be discounted in its evaluation. This could compromise safety if S is chosen poorly, or if there are pathological cases where some unsafe action has fewer reasons to think it unsafe than the actions in S (this seems unlikely at first glance).
  • Have some method for evaluating elements of R for being “reasonable”, and only consider “reasonable” advice, or downweight “unreasonable” advice. This can’t include looking at r directly (in case it is falsely convincing), but might including looking at properties of it that you could evaluate with some simple computer program (ie. is it a sensible english sentence?) or that you could evaluate by looking at the function F(a,r) (perhaps we might only want to trust reasons r which are convincing for some percentage of actions which is not too large or too small). This could compromise safety if the scheme labels valid reasons for being suspicious of an action as being “unreasonable”.