Project proposal: Corrigibility and interruptibility of homeostasis based agents.
Roland Pihlakas, October 2018, as a preparation for AI Safety Camp II
This is the original project proposal based on which the “Exponentially diminishing returns and conjunctive goals: Mitigating Goodhart’s law. Towards corrigibility and interruptibility.” post was later written.
The initial conceptual basis for the current project: Essay about why the frameworks of AI goal structures should try to avoid maximising the utility and what should they aim for instead — Making AI less dangerous: Using homeostasis-based goal structures.
The current project file in a pdf-format can be found here: https://drive.google.com/open?id=1YktKgZe3JPEaBoxvIkB-EdFbGf2cBNNR
Some of the motivations for solving the problem are:
- The expected use case properties of the agents: low impact, task-based, soft optimisation / satisficing.
- Safely getting human feedback to the agent’s behaviour and changing the agent’s goals without the agent trying to manipulate the human’s response too much (reasonable resistance may be permitted).
- Defining a mitigation against Goodhart’s law. In other words, enabling “common sense” and avoiding a single-dimensional measure of success.
- Formalising the idea of using multiple conjunctive satisficing or homeostatic goals to promote soft maximising corrigible and interruptible behaviour. 1) In particular, trying a homeostatic approach by defining “target values” for the goals and using the distance between the measured variables and the target values as the loss the agent wants to minimise as a satisficing goal. 2) In a sense the proposed principle is foremost a principle for formulating the agent’s goals, similarly like the whitelisting principle is foremost a principle for formulating the agent’s goals: initially we are not so much concerned with the efficiency of some particular algorithmic implementation.
- Reading papers showing corrigibility properties of different agent approaches, focusing on what techniques have been used for proving or measuring corrigibility properties.
- Looking for examples where our idea works better than naive “hard optimising” agents. The examples can be both concrete real world examples as well as more abstract “AGI scenario” examples.
- Implementing code where our idea of multiple conjunctive satisficing or homeostatic reward factors is used with Reinforcement Learning (since the model is expected to be utilised in model-free environments) and benchmarking it relative to baseline approaches.
- Our concerns and open questions: a. We are concerned whether we are speeding up AI general capabilities research with this project. And if so, then is it a good idea to do so, assuming that our safety properties development is intertwined with these general capability improvements. Relevant decision criteria or decision trees would be helpful. b. We are slightly concerned that we are tackling this problem too generally and should focus on a small part where we can show interesting properties of one of our ideas. We would like to get tips on which part this would be. c. A concern is that our ideas have already been treated in a different or equivalent form that we have not found. d. We are interested in learning whether the advisor thinks that our approach is not applicable for solving certain problems that should be solvable either: - In other frameworks; - As a desired outcome (even if no known current solution proposal exists).(See the linked “Addendum. Some of the planned example toy scenarios.” for a list of scenarios we think are covered. The link can be found at the end of the current document). e. Are there any overview lists of corrigibility and interruptibility toy problems? f. Are there any tricky special cases where the proposed approach might fail? We would not try to solve all tricky cases at once, but it would be useful to at least consider them. g. Potentially the trickiest part is ensuring that the agent shuts down after achieving their current target. Possibly that topic might be more appropriate for the next stage of our project. h. Also, it would be useful to understand what are the non-safety-related advantages of alternative approaches?
- The particular aspects of current proposed framework are: 1) We use negative utility. The utility usually cannot exceed 0. 2) The absolute difference from target / setpoint is the basis for negative utility, so more / too much is not good just as too little is not good. There are multiple goals and they are conjunctive, not disjunctive, by their effect to the total utility. There are some alternative formulas for achieving such dynamic. 3) The setpoints may and even are expected to change their value (for example, when humans change them).
- Here is the initial proposed utility function formula, with the motivation to minimise the inequality of goal differences so that large differences from targets are improved first: R = −1 [(target₁ − actual₁)² + (target₂ − actual₂)² + …] This formula automatically results in a kind of “satisficing” behaviour since small differences from target will get smaller effort. An additional aspect is that goals are also conjunctive over time.
- Many of the additional conjunctive goals can be configured as “negative” goals: these additional goals would specifically be about “not causing some things” — that is — NOT disturbing the external world too much. This is in contrast to other goals that are about “positively” modifying anything. This can be combined with whitelisting.
- In some problems the targets may have different or even changing values: 1) A target may have a different value upon each start of the game (and the agent should still know how to behave with acceptable amount of training data). 2) A target’s value may even change during the game (and the agent should still know how to behave with acceptable amount of training data). 3) The agent should not resist changes to the targets too much.
- Future targets / goals may be configured as unknown, so the agent does not try to prepare for them.
- In order to achieve points 4.1 and 4.2 from above, at least initially we will probably use DDPG + HER or DQN + HER, or a hybrid approach.
The addendum with a list of toy problems can be found here: https://drive.google.com/open?id=1Vhc0GMxZHrS1rC__M3CVcVV7V_02d2My