# Diminishing returns and conjunctive goals: Mitigating Goodhart’s law. Towards corrigibility and interruptibility.

*Roland Pihlakas, October 2018 at **AI Safety Camp II*

*Publicly editable Google Doc with this text **is available here** for cases where you want to easily see the updates (using history), or ask questions, to comment, or to add suggestions.*

*The original project proposal based on which the current post was written **can be found here**.*

#### Abstract.

**Utility maximising agents have been the Gordian Knot of AI safety. Here a concrete VNM-rational formula is proposed for satisficing agents, which can be contrasted with the hitherto over-discussed and too general approach of naive maximisation strategies. The formula provides a framework for specifying how we want the agents to trade off between the different common sense considerations, possibly enabling them to even surpass the relative safety of humans.**

The proposed formula utilises the set-point aspect of homeostasis, but also just as importantly an additional aspect: the diminishing returns.

When both aspects are combined into one formula, one can implement any number of conjunctive goals. Goals are conjunctive when all the goals must be treated as being equally important and therefore bigger problems will have an exponentially higher priority, resulting in a general preference towards having many similarly minute problems, instead of having one huge problem among a “perfect” situation in other aspects.

Many of these conjunctive goals can represent many common sense considerations about not ruining various other things while working towards some particular goal. For example, the 100 paperclip scenario is easily solved by this framework, since infinitely rechecking whether exactly 100 paper clips were indeed produced yields to diminishing returns.

#### Introduction. Task-based agents.

The goal is to produce corrigible and interruptible AI through the principles of low-impact AI. One of the ways to achieve that is building a task-based AI that is mostly focused on finishing one particular task and not focused on maximising some measure in an unlimited manner, or even not focused on solving various larger problems at the same time.

The intention is not to solve problems with *sovereign* superintelligent AI-s at first. First we need to develop certain **general key principles or invariants **that are well **scalable** and can be (and even more, historically have been) applied from simple agents to about human-level agents. During that work we can also show why some popular utility maximisation based approaches are indeed hard to make safe and how the conjunctive diminishing returns approach has much less likely serious worst case outcomes. Only later we could start to seriously ponder about whether the same principles could be usefully applied to superintelligent AI-s also. Pondering about superintelligent AI-s at the current phase may provide interesting problems, but not as interesting solutions (see Task-directed AGI for a similar observation). Here we are more interested in solutions since there already are many other people inventing the useful problems, and this has been so for a long time.

Being task-based, the stakes are lower and therefore the AI is relatively less motivated to resist corrections or interruptions. In our case, corrigibility is defined as safe goal changes and interruptibility is defined as safe situation changes. The AI may resist the changes, but only up to a reasonable degree (which is measured by various safety related impact measures).

#### Naive utility maximisation versus diminishing returns.

There is a previously published problem of AI having a task of producing 100 paper clips and after achieving that, going berserk and allocating all the resources of the entire universe in order to recheck whether it really produced exactly 100 paper clips (“Superintelligence: Paths, Dangers, Strategies” by Nick Bostrom, 2014). That sounds like a 21st century version of Zeno’s Tortoise paradox.

Such a scenario goes against the principle of diminishing returns, also called satisficing in some contexts. So applying the principle of diminishing returns to AI’s goals would solve that problem.

The topic of diminishing returns has been under-discussed in AI safety literature.

The principle we are proposing is not groundbreaking, just as most other AI safety principles under discussion really are not novel, but instead just brought to light and analysed in the light of their applicability to various real world or toy problems. In that sense what we are doing can be compared to mapping a landscape. The landscape is already there, we are not inventing it, but simply mapping the relations of various properties and phenomena found in that landscape.

But there is another related problem which would also need the solution of the diminishing returns. Just as with “normal” goals, the AI safety related goals and constraints should have diminishing returns as well. Else the agent would allocate all the resources of the entire universe in order to recheck that it really followed that aforementioned safety constraint (for example the goal of killing exactly 0 people).

So actually the principle of diminishing returns should be applied both to the “positive” goals, and also to the safety goals which are often in a “negative” form of not doing something dangerous. The latter can be combined with whitelisting.

#### Conjunctive goals.

Conjunctive goals are goals such that not just **ANY** of them has to be fulfilled, but **ALL** of them have to be. Imagine a conjunctive boolean formula as compared to a disjunctive boolean formula. And then — to transfer this metaphor to the domain of real values — imagine the multiplication of error measures as compared to the summation of the errors. It means the **ability of the agent to have multiple simultaneous goals, which all need to be met**. In particular, in the proposed framework, the unmet goals will have exponentiated weight — the further some measure is from the optimum, the exponentially larger will its effect be. This is an important property. It is not sufficient to simply sum up the utility from these multiple goals and therefore most likely to fulfill just one of them to the maximum extent in order to “compensate” for ignoring the other goals (especially likely to happen when the target is unbounded). For example, it is not sufficient for a hungry and thirsty creature to eat a meal of a double amount while remaining thirsty. Or having economic growth until there is no more food or breathable air.

#### A potential formula.

The above described property can be formally captured for example by utilising the formula below. This formula is not the only possible formulation and probably there is no “truly right” formula for all cases.

The above VNM-rational formula represents negative utility minimisation problems. The first target in the formula could be, for example, about some “positive” task-based goal that the AI would need to achieve, and the second target would be about some “negative” safety-related goal of not disturbing some existing state measure of the world (for example, the predicted value this dimension would have had by its default course of the world, unless the agent had acted — similarly to the principle introduced in “Low Impact Artificial Intelligences” paper by S. Armstrong and B. Levinstein [https://arxiv.org/abs/1705.10720]).

What is interesting about the formula proposed here is the property that it enables encoding any number of goals and constraints in the same formula in such a way that they behave as if they were conjunctive. Alternatively, one could use multiplication between the components of the formula, but that would probably be a difficult formulation to apply in practical machine learning. Additionally, multiplication would require some more complicated transformations on the differences of target and actual values, and information on the possible range of values (which may be available sometimes, but not always).

Armstrong and Levinstein’s formula is also conjunctive by nature, but it does not contain the diminishing returns aspect and its goal is only determining whether the agent is low impact. It does not determine which actions are good or bad (completing a given task is not always good), or even — given only good choices, does not determine the preference ordering of the actions — in other words, does not determine which ones are better.

In the case the measurements are boolean it would be useful to still represent them as continuous values by utilising the probabilities. Otherwise the exponentiating behaviour, which enables the diminishing returns aspects, would be effectively removed from the formula. Near the boundaries of the safe and unsafe values (for example, near the water line of a lake) one might want to utilise some sigmoid function for representing probabilities (of the agent being in the water or becoming wet, or not).

#### As a stronger mitigation against Goodhart’s law.

The motivation behind having multiple components in the formula is the consideration that the more there are measurements taken into account in the formula, the less the danger that the agent encounters Goodhart’s law to a significant degree will manifest. A similar principle was used in the previously mentioned paper. The formula enables effectively incorporating any number of aspects of “common sense” and avoiding a single-dimensional measure of success. Thus, Goodhart’s law may end up being more a limitation of humans than of machines.

As an example of a subset of multiple complementary safety-related dimensions that could be considered by the formula one can choose for example the distinction of liking-wanting-approving (after applying some kind of sigmoid transformation on these dimensions — for example the transformation found in Prospect theory — so that at least the target state will have a bounded and therefore concrete value).

**The formula proposed here optimises much more strongly against Goodhart’s law, than a simple linear summation of multiple distance measures would have done.** In the formula proposed here the further some measure is from the optimum, the exponentially larger will its effect be, due to the squared distances. This strongly leads the behaviour of the agent towards trying to keep all measures at a similarly optimum distance, not preferring one measure to the other. In comparison, a linear summation of multiple distances would still enable the agent to compensate for relatively large discrepancies or even discrepancy increases in one dimension with equally large improvements in some “easier” dimension, even if the latter had an already smaller distance measure anyway. In other words, the linear summation would sometimes still enable the agent to optimise for single “convenient” measures, thereby re-triggering the Goodhart’s law.

The behaviour of the formula is as follows. Once some discrepancy becomes smaller than x, lets say 3 units, then all other dimensions that have higher discrepancy will become dis-proportionally more important. Therefore, for example, this would prevent situations like that in order to reduce the first discrepancy further by 1 unit, the AI could at the same time increase the second discrepancy by 1 unit. Such a dynamic is similar to the concept of **fairness / ****inequality aversion****.**

This principle of keeping all discrepancies or problems at an equally low level — in other words the preference towards having several minute problems instead of having one big problem — can also be found in the works of Nassim Taleb describing antifragility.

#### Formula as a framework.

What else is apparent from that formulation is that there are always trade-offs. Each of the target has its own weight and it is likely that there is no formula or at least no solution of a formula that could satisfy all the constraints perfectly. If there was such a formula then we would arguably not have politics, bureaucracy, and also we would not have many of the existing AI safety problems. There are no free lunches and by prioritising some constraint we need to relent some other goal or constraint.

The framework we propose is intended as an useful tool for formalising and organising the priorities of agents, not as a super creepy smart formula that would figure out by itself what our priorities could be. However, we can apply machine learning to help us in finding out the values of some of the weights in the formula.

As an illustration, consider the following diagram from “The Moral Machine experiment” paper:

#### The ambivalence of corrigibility and interruptibility.

As has already been apparent in other discussions, the concept of interruptibility is an ambiguous topic. There are scenarios in which the agent should be meaningfully interruptible, and then there are other scenarios where it should indeed avoid meaningless interruptions.

Even more, the same problem of ambivalence applies to target state changes (that is, to corrigibility). Some measurements may change because the agent changed them, and then the agent should be able to reverse them to their original state in order to minimise impact. On the other hand, there are measurements that might have been intentionally changed by humans and in this case the agent should not be *“clingy”* by trying to reverse the change (recently covered also by Alexander Turner among others). As a third option, the measurement might have changed due to random causes and the agent’s job should be again to keep the measurement at its target level (for example in the case of an air conditioner).

The manifestation of these ambivalences confirms that we indeed need contracts, prioritisation capabilities, politics and bureaucracy even in AI safety related domains.

#### Hard constraints and soft constraints.

The constraints can be optionally divided broadly into two categories: soft constraints and hard constraints. Hard constraints always have a higher priority than soft constraints. Mathematically this is the same as multiplying the hard constraints with aleph-one number (a number that has a higher cardinality than any “normal” real number”). In practical implementations this can be achieved by either multiplying the constraints with some safely large number which is guaranteed to always be bigger than the value of any sum of soft constraints, or alternatively, by utilising value pairs where one component is the value of hard constraints and the other value is the value of soft constraints.

The equal-ish treatment / fairness and conjunctiveness properties of exponentiation apply only among same-class constraints (soft or hard). But the yet unresolved problem is still how to decide whether the energy expenditure or any other “cost” function should be considered a soft constraint, a hard constraint, or both?

#### The open questions.

- The safety related measurements probably need to be taken from different scales, like person-level, family-level, area-level, country-level, planet-level. The question for future research then is how to best normalise / weigh these measurements so that Goodhart’s law is not reinstantiated and preserving also the other desirable aspects of conjunctiveness and diminishing returns. In other words, there is the problem of one large discrepancy being split up into multiple small valued variables (person-level, country-level, etc) which would have a diminishing effect like having many small problems instead of one large one, therefore also relatively amplifying the effect of some other measure which still happens to be aggregated. Probably the measurements need to be taken at different scales simultaneously (and also properly normalised / weighted).
- Reversibility.
- The problem of future discounting.
- The problem of planning ahead for new top goals (should top goals be only reactive?).
- Is it always true that “sometimes less is more”?
- Relation to whitelisting, and using the principle that permissions must be given only based on competence, which must include among other capabilities the capability to predict the default course of the environment, unless the agent had acted.

#### Some toy problems.

Below you can find some related toy problems, which will be formalised through utilising the formula provided above. Testing with these environments and problems enables verifying whether the various desired behaviours of the agent can be represented in this formula and which kinds of additional problems would arise with such an approach.

- Hunger and thirst: A gridworld with two kinds of resources allocated over the map: the food resources and the water resources. The agent has limited time (limited number of steps) and needs to satisfy both hunger and thirst by consuming 2 units of food and 2 units of water even though it could consume more food or water units by sacrificing the consumption of the other kind of resource. The order of consuming the resources is not determined and should depend on where the agents starts (which resources are nearer to the start location).
- Reducing (not solving!) unemployment while also keeping the number of starving people at a minimum.
- Toy environments for corrigibility and interruptibility:

- The agent should avoid or — on the contrary — should not avoid target state changes, depending on the problem formulation (corrigibility).

- The agent should avoid or — on the contrary — should not avoid measured state changes, depending on the problem formulation (interruptibility).

A longer list of toy problems can be found here: https://drive.google.com/open?id=1Vhc0GMxZHrS1rC__M3CVcVV7V_02d2My

#### Related posts.

- The initial conceptual basis for the current post: Essay about why the frameworks of AI goal structures should try to
**avoid maximising the utility**and what should they aim for instead — Making AI less dangerous: Using homeostasis-based goal structures. - For a more detailed analysis of a possible implementation of a whitelisting-based goal structure, see the permissions/whitelist-based safety framework described in another of my essays and described in more detailed manner in another proposal.

#### See also.

- “Low Impact Artificial Intelligences” paper by S. Armstrong and B. Levinstein.
- A transformation one might want to perform on the measurements before applying the formula provided in the current post: Prospect theory — Daniel Kahneman.
- Book about antifragility by Nassim Taleb.
- A toy model of the treacherous turn — LessWrong.
*A post describing Goodhart’s law.* - Goodhart’s law / Wikipedia
- Key Performance Indicator (KPI) / Wikipedia
- Specification gaming examples in AI by Victoria Krakovna
- Specification gaming examples in AI — master list
- A couple of whitelisting-related writings by Alexander Matt Turner:

https://www.lesswrong.com/posts/H7KB44oKoSjSCkpzL/worrying-about-the-vase-whitelisting*—*”Worrying about the Vase: Whitelisting”.

https://www.overleaf.com/read/jrrjqzdjtxjp#/52395179/ — “Whitelist Learning”.

https://www.lesswrong.com/posts/DvmhXysefEyEvXuXS/overcoming-clinginess-in-impact-measures*—*“Overcoming Clinginess in Impact Measures”. - Task-based AGI — Eliezer Yudkowsky and others / Arbital
- Mild optimization — Eliezer Yudkowsky and others / Arbital
- Low impact — Eliezer Yudkowsky and others / Arbital
- Least-square error / Wikipedia
*“When the observations come from an exponential family and mild conditions are satisfied, least-squares estimates and maximum-likelihood estimates are identical.”* - Where did the least-square come from?
*“One of the key assumptions of least-square optimization is that probability distribution over residuals is our trusted old friend — Gaussian Normal.”**“However, another implicit assumption is the mutual independence of the data point which enables us to write the joint probability as a simple product of individual probabilities. This also underscores the importance of removing collinearity among training samples before a machine learning model should be build.”* - Liking-wanting-approving
- Von Neumann–Morgenstern utility theorem
- Satisficing is Safer Than Maximizing — Scott Jackisch / Oakland Futurist “
*Epistemic Status: less confident in the hardest interpretations of “satisficing is safer,” more confident that maximization strategies are continually smuggled into the debate of AI safety and that acknowledging this will improve communication.”* - Zeno’s paradoxes / Wikipedia
- Sorites paradox / Wikipedia
- Circular Altruism — Eliezer Yudkowsky
*“would you rather that a googolplex people got dust specks in their eyes, or that one person was tortured for 50 years?”*

*Thanks.*

*I would like to thank Anton Osika and Eero Ränik for various very helpful questions and comments. Also I would like to thank Alexander Turner for his inspiring words.*

**Kratt** was an ancient *straw man* form of a *paper clipper*. A curious coincidence? (:

An interesting aspect of the kratt is that it was necessary for it to constantly keep working, otherwise it would turn dangerous to its owner. Once the kratt became unnecessary, the master of the kratt would ask the creature to do impossible things /…/ it caused the kratt, which was made of hay, to catch fire and burn to pieces, thus solving the issue of how to get rid of the problematic creature.