How can a reward function be aligned if it doesn’t recognize damaging behaviour?
David Krueger
1

By aligned I mean something like “trying its best to do what the user wants.” It’s possible that this is not what most people have in mind, so that I should use a different term. It’s meant to be analogous to a chess AI which is trying to win but sometimes loses nonetheless.

Show your support

Clapping shows how much you appreciated Paul Christiano’s story.