AI “safety” vs “control” vs “alignment”
Paul Christiano
661

Bostrom defines the “control problem” the way you do. Stuart Russell defines AI control as the problem of enabling “meaningful human control over an AI system after it begins to operate,” both for superhuman and subhuman AI (http://futureoflife.org/data/documents/research_priorities.pdf).

Whatever the definition, a lot of people don’t like the term “control” here because it suggests we’re trying to constrain / manipulate / enslave superintelligent systems. This doesn’t play well with typical techno-optimists and transhumanists, and also risks causing people to think that ASI safety is mostly about AI-boxing, tripwires, and the like (rather than mostly about value learning, or mostly about highly reliable agent design). It elides the importance of goal content for ASI safety, which is the same founding mistake AI itself made.

The thing you’re calling “value alignment” is what MIRI calls “value specification” or “preference specification,” and what Bostrom calls “value loading.” I think Russell uses the term “value alignment” the way you do. People at MIRI generally prefer to use “(AI) alignment” to refer to what you’re calling “control,” since we already have a wealth of OK terms for referring to value specification (+ more specific terms like “value learning,” “IRL,” “CIRL”…), but we have few if any good alternatives for picking out the ASI safety as a whole (given the problems with “control,” “friendliness,” etc.), and naming the field well takes priority.