How can a reward function be aligned if it doesn’t recognize damaging behaviour?

…ion (covert malicious behavior). I am concerned about AI systems behaving in ways that are damaging yet invisible even to a sophisticated and aligned reward function. In the very long run I think this problem will arise naturally, but in the short term it can alread…

Security and AI alignment

162