Approval-directed agents
Paul Christiano

Approval-directed bootstrapping

Approval-directed behavior works best when the overseer is very smart. Where can we find a smart overseer?

One approach is bootstrapping. By thinking for a long time, a weak agent can oversee an agent (slightly) smarter than itself. Now we have a slightly smarter agent, who can oversee an agent which is (slightly) smarter still. This process can go on, until the intelligence of the resulting agent is limited by technology rather than by the capability of the overseer. At this point we have reached the limits of our technology.

This may sound exotic, but we can implement it in a surprisingly straightforward way.

Suppose that we evaluate Hugh’s approval by predicting what Hugh would say if we asked him; the rating of action a is what Hugh would say if, instead of taking action a, we asked Hugh, “How do you rate action a?”

Now we get bootstrapping almost for free. In the process of evaluating a proposed action, Hugh can consult Arthur. This new instance of Arthur will, in turn, be overseen by Hugh—and in this new role Hugh can, in turn, be assisted by Arthur. In principle we have defined the entire infinite regress before Arthur takes his first action.

We can even learn this function by examples — no elaborate definitions necessary. Each time Arthur proposes an action, we actually ask Hugh to evaluate the action with some probability, and we use our observations to train a model for Hugh’s judgments.

In practice, Arthur might not be such a useful assistant until he has acquired some training data. As Arthur acquires training data, the Hugh+Arthur system becomes more intelligent, and so Arthur acquires training data from a more intelligent overseer. The bootstrapping unfolds over time as Arthur adjusts to increasingly powerful overseers.