An idea for creating safe AI

Just over ten years ago I completed a PhD in the mathematics of artificial intelligence. For my PhD I studied a problems where there is an absorbing state, and my research involved specifying the conditions under which you can be sure that the absorbing state will be reached.

Some of the work I did is published here:

I recently came across the following paper, which studies a similar problem but in the context of universal artificial intelligence. In the conclusion it is stated that the results of this paper have implications for AI safety (and I agree).

However the results of the above paper only apply in the case where the rewards are either all positive or all negative. In the former case it seems obvious that the agent will avoid termination at all costs, whereas in the latter case it seems obvious that the agent will seek termination at all costs.

In contrast, my PhD work studied the case where rewards may be positive or negative. This seems much more applicable to real-world AI applications, where you would presumably want to give an agent a positive reward for doing something ‘good’ and a negative reward for doing something ‘bad’.

I have therefore been wondering whether my PhD work could be applicable to AI safety. If you design an agent in such a way that the conditions specified in my thesis are met, you can be sure it will terminate after a finite time and thereby limit the damage it can cause.

I’ve been imagining a toy problem where at each time step the agent can either:

1. Complete a task that has been assigned to it and then terminate (we can call this the ‘terminal action’), or

2. Take an action that will enable it to complete the task more effectively at the next time step (we can call this the ‘instrumental action’).

Examples of instrumental policies could involve the agent taking an observation, modifying its source code, or creating a ‘sub-agent’ to help it carry out the task. The concern is that the agent will continually take the instrumental action and potentially wipe out the human race in the process.

But all you have to do to ensure the agent doesn’t do this is to give it a negative reward each time step that it doesn’t complete the task. If this negative reward is greater than or equal to the reward it gets for completing the task then the AI will always have an incentive to terminate.

To see this, consider a simple model where the state space is the set of positive integers and in each state s the AI can either terminate and receive a reward of 1–1/s, or not terminate and move to state s+1 receiving a reward of -c. Then the agent will always choose to terminate provided c≥1.

This model may seem trivial but I think it demonstrates that all you need to do to stop an agent from pursuing instrumental policies is to ensure it is sufficiently ‘punished’ for doing this. (On the other hand some might consider this model too simplistic to be of any practical use.)

I am keen to get feedback both on the above model and on the idea in general, in particular to find out if there is a fatal flaw that I am not seeing.