Approval-directed agents
Paul Christiano
1017

Approval-directed search

Consider an approval-directed agent Arthur trying to solve a Sudoku puzzle, overseen by a human Hugh who is not smart enough to solve the puzzle himself. Suppose that Arthur is smart enough to see the solution to the puzzle, but is being asked to fill in the grid one letter at a time (recall that an approval-directed agent greedily chooses its next action in order to maximize approval).

Will Arthur solve the Sudoku puzzle? The answer, of course, depends on how Hugh assigns approval.

First, let’s be a bit more precise what I mean by “Hugh.” We imagine Hugh sitting at his desk with a computer (the computer on which Arthur is running, say). To define approval, we imagine Hugh being interrupted by his computer, and asked to evaluate a given action. To do so, he can run any program on his computer (including Arthur).

Here is a strategy Hugh can use to ensure that Arthur will solve the Sudoku puzzle. Given a proposed action a, take that action, and then ask Arthur how to fill in the rest of the grid. Assign a approval of 1 if the grid ends up filled out correctly and approval of 0 otherwise. (If anything unnecessarily funny seems to be going on, just stop and assign an approval of 0.)

It’s easy to see that if a is the final move to correctly fill in the grid, it will receive approval of 1 and Arthur will take it. Moreover, if Arthur is as clever as we are (and understands Hugh’s strategy) then he will also be able to see this. As a result, if a is the second-to-last move, then it will also receive approval of 1 (since it will be followed up by the correct final move). Arthur will see this too.

Proceeding by backwards induction, every correct move receives approval of 1, and Arthur knows it. Conversely, every incorrect move clearly receives approval of 0. As a result, Arthur takes correct moves and solves the puzzle.

The same idea seems to apply to essentially any environment. This justifies ignoring the level of resolution at which we define “actions;” it doesn’t matter if an action is “drive home” or “turn left” or “output 0 as your next bit.” Hugh can decide what level of granularity to evaluate actions at, depending on the context.

Creeping maximization

A benefit of approval-directed behavior was that it involves greedy decisions: Arthur chooses his next action in order to maximize approval, without considering long-term plans.

If Hugh uses the strategy described above, Arthur ends up choosing a maximizing strategy for filling out the grid: he takes a long sequence of actions engineered to maximize how much Hugh approves of the resulting grid.

Hugh is free to choose the level of granularity at which he evaluates actions. But whatever granularity he chooses for “an action,” he will get maximizing actions at that level of granularity. This is exactly what he signed up for, by considering decisions at a higher level of granularity.

Show your support

Clapping shows how much you appreciated Paul Christiano’s story.