The value alignment problem as an interactive game

In this post I will explain how I think the value alignment problem can be modelled as an interactive game between a human and a robot. (For anyone unfamiliar with the term, ‘value alignment’ refers to the problem of how to ensure the values of a superhuman artificial intelligence are aligned with human values.)

The paper below provides an approach to formalising the value alignment problem called ‘cooperative inverse reinforcement learning’ (CIRL), which seems to me to be very promising. This is a game-theoretic extension of an existing approach called ‘inverse reinforcement learning’, where a robot learns a reward function from observed human behaviour.

The main drawbacks with this approach are (1) that the human has to know the reward function, and (2) that you have to solve for the human’s policy. It would be more realistic to set up the model in such a way that the human and robot both learn to act optimally together, and indeed this is noted in the ‘conclusions and future work’ section at the end of the paper, where the authors say:

“An important avenue for future research will be to consider the coordination problem: the process by which two independent actors arrive at policies that are mutual best responses.”

A more general criticism of the CIRL approach is that it is not clear what assumptions can be made about the human. One approach could be to assume that the human acts optimally given their beliefs, but the problem then is that by observing the human’s behaviour the robot can only learn what the human believes to be optimal, rather than what is actually optimal.

If the robot uses a stationary policy then the human is effectively solving a standard reinforcement learning problem. I therefore think it reasonable to assume that the human will be able to find a policy which is optimal with respect to any stationary policy taken by the robot, given sufficent time, as many standard reinforcement learning approaches have been proven to converge.

Furthermore, if the human uses a stationary policy then the robot is effectively solving an inverse reinforcement learning problem. I therefore think it reasonable to assume that the robot will be able to estimate the reward function associated with any stationary policy taken by the human (although convergence results for inverse reinforcement learning seem a bit thin on the ground).

The problem is that if both the human and robot are updating their policies then neither is using a stationary policy. However it seems plausible (to me at least) that convergence results for the stationary case could be used to prove convergence in the case where both the human and the robot are updating their policies over time.

I think proving convergence is particularly important in the AI safety field as we need to be sure that any proposed approach works in every possible scenario, not just in the ones we happen to have tested. However there is obviously a lot of work to do to turn the informal argument above into a formal convergence proof.

Before embarking on this work it would be good to get feedback from others on this idea, on particular on whether my the assumptions I’ve set out above seem plausible.