“Given a distribution A over policies that [is] ε-close to a benign policy for some ε ≪ 1”
William Saunders

The concern is not just an incorrect answer but one which is actively destructive (I’d now say “corrigible” rather than “benign”). Security amplification is the problem of reducing the prevalence of inputs that lead to malign/incorrigible behavior with significant probability, the hope is that by iterating it you can get something that is probably approximately corrigible for every input (the reason to prefer corrigibility is that “approximately corrigible” is probably good enough, whereas it’s unclear what to do with “approximately benign”). I hope to write about security amplification in more detail soon. The ALBA implementation is only potentially OK (in this respect) if meta-execution works for security amplification.

Like what you read? Give Paul Christiano a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.