How can a reward function be aligned if it doesn’t recognize damaging behaviour?
David Krueger

By aligned I mean something like “trying its best to do what the user wants.” It’s possible that this is not what most people have in mind, so that I should use a different term. It’s meant to be analogous to a chess AI which is trying to win but sometimes loses nonetheless.

Like what you read? Give Paul Christiano a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.