Making AI less dangerous: Using homeostasis-based goal structures
Roland Pihlakas, December 2017
Publicly editable Google Doc with this text is available here for cases where you want to easily see the updates (using history), or ask questions, to comment, or to add suggestions.
The idea itself originates from a part of research proposal of mine from July 2007 but was unfortunately not elaborated until now.
I would like to propose a certain kind of AI goal structures that would be an alternative to utility maximisation based goal structures. The proposed alternative framework would make AI significantly safer, though it would not guarantee total safety. It can be used at strong AI level and also much below, so it is well scalable. The main idea would be to replace utility maximisation with the concept of homeostasis.
A more detailed VNM-rational formula and analysis developed based on the current post is available here: Diminishing returns and conjunctive goals: Mitigating Goodhart’s law. Towards corrigibility and interruptibility.
Most usually the AI safety is discussed in a framework where the AI has to maximise some goal, often also over an unlimited amount of time. As already well observed, this implies a big danger of the AI going berserk.
I think this approach is primitive, even alien, singularly focused, and actually not necessary at all.
There are alternatives. For example approaches inspired by nature and control theory.
I would like to propose one such approach and explain how it is different from currently popular ones. That approach would be scalable from basic AI-s up to strong AI-s.
The currently very popular utility maximisation framework provides some understandable mathematical conveniences. But in any cases which are even minimally more elaborate, this convenient simplification becomes a really insane multiple-faceted hindrance.
There are already also some other proposals for escaping the utility maximisation framework (quantilising and satisficing). Though these still seem to strive for utility maximisation, they do this at least with certain more reasonable constraints.
An alternative approach.
In nature there is the principle of homeostasis. Creatures do not try to maximise anything, instead, they try to keep things at a certain balance, at their setpoint, and then settle to rest.
In nature this is achieved through operant behaviour which is analogous to control systems in cybernetics, in that it is directed towards achieving and maintaining the homeostasis, NOT towards maximising the reinforcement (as in basic reinforcement learning), as has been probably presumed in popular knowledge/misconception due to superficial reading!
Maximisation of the reinforcement in the framework of operant behaviour is sought only in certain pathological cases like addiction — caused by certain tricky reinforcement schedules, which in turn are enabled through certain heuristical mechanisms in brain. So it is in no way a normal state.
Of course, some needs automatically arise again and again, like for example is the case with curiosity. But regardless, the goal is not to maximise anything or even not necessarily to plan too much ahead about fulfilling the future needs (in the current example, the future need “curiosity”).
About the safety of the alternative framework.
This way the agents can obviously still cause noticeable damage — all active agents can and probably will cause much damage, that is a given!
This assumption applies to the currently proposed framework as well. So the proposed framework is only a partial solution.
For further explanation on one of the fundamental causes, read my essay about self-deception. One of the main topics of this essay is about the side effects of actions, and fundamental limits to computation due to fundamental limits to attention-like processes.
But the damage of homeostasis-based agents will be at least much more limited after the agent settles down to rest, as compared to a basic utility maximisation framework where the robot will go on like a restless madman or runaway engine, until the bitter end.
Therefore my proposed framework is probably safer — in comparison, almost “saint-like”.
Even more, such an agent can have multiple simultaneous goals, which all need to be met. This is an important property. In the proposed framework it is not sufficient to simply sum up the utility from these multiple goals and therefore most likely to fulfill just one of them to the maximum extent in order to “compensate” for ignoring the other goals. For example, it is not sufficient for a hungry and thirsty creature to eat a meal of a double amount while remaining thirsty. Or having economic growth until there is no more food or breathable air.
Therefore — the agent cannot afford to achieve just ANY goal to a great extent, instead, it has to achieve ALL of them up to a reasonable extent.
The important result is that when an agent goes too far in achieving one of the goals, this will inevitably cause big expenses regarding the other goals. So then the agent will simply abandon the costly goal and pay attention to the other goals too.
The other goals can for example be goals to keep some indicators at their original level (for example, as long as they are changed by the agent, not by external causes) and therefore not to cause havoc and negative externalities in the world while achieving the “main” goal.
As an implementation detail, I suggest that these additional don’t-modify-some-stuff kind of goals can be implicit — that is, not blacklist based, but instead automatically generated based on a whitelist of permissible changes.
The creator of both kinds of goals should consider the expertise, and also sensory, inference, and executive capabilities of the AI. In case anything goes wrong, the responsibility would be on the creator of goals. For a more detailed analysis of a possible implementation of such a goal structure, see the permissions/whitelist based safety framework described in another of my essays and described in more detailed manner in another proposal.
Partial corrigibility and interruptibility (at least minimally).
Based on the above, the proposed alternative approach would be to have a goal system where the AI has some specific targets to achieve regarding some indicator. After reaching the targets it shuts down.
In reinforcement learning terms it would mean that after specific situation happens, it gets maximum utility from doing nothing at the given moment, discounting the future. “Doing nothing (discounting the future)” to the extent that when some external agent modifies the state and therefore causes new needs to arise, or even more — causes the goal structure of the AI to change, the AI would not resist the least. Since up to moment the new needs or goals arise the AI is only concerned in doing nothing, and interfering with something definitely would not classify as doing nothing.
The AI would only be concerned with achieving the given goals, not with anything that happens in between the goals being met and new goals arising. Avoiding some new needs from arising would be a different goal structure which does not need to be activated.
As an analogy, usually in our contemporary normally functioning world no healthy intelligent creature would even think about killing their “master” just because the master invents new tasks to perform, or because the master causes new situations to be solved. It is normal to have a feedback loop.
One more important aspect. Since the AI has a goal to achieve something and then cool down, it will even actively avoid doing more than necessary. “Less is more.”
Additionally, it is reasonable to configure the goal system so that it has an additional conjunctive goal of avoiding damaging the master and even avoiding conflicting with the master. Which then means that the agent will rather yield than resist. At the same time, reasonable “civilised” resistance, like proposing a few counterarguments, should be considered permitted and even useful.
Dealing with future uncertainty.
Finally, just like in reinforcement learning, future utility discounting can be combined with the approach above.
So the agent knows that it does not know its future goals and will not try to plan too far ahead. Since in the above proposed framework the goals may really obviously change in unforeseeable ways, plus there are also many of them, and the agent knows all of this well, the effect from discounting will be stronger — there really would be a negative result from planning too far ahead, for example since that would incur opportunity costs.
This is in contrast to a case with basic reinforcement learning, where even discounted maximised utility ultimately results, over unlimited time, in a substantial amount of benefit to pursue.
In short, the proposal currently consists of the following points:
1. Goals are specified as homeostatic setpoints. This architecture results in the following:
2. Do the deed and cool down. Therefore damage from missing expertise, stupid reasoning or unalignment is limited.
3. Not just passively, but even more — actively avoid doing more: less is more.
4. Future needs/goals may be configured as unknown, so the agent does not try to prepare for them.
5. In certain conditions the agent does not interfere with new needs arising or with new goals being set. This counts as a partial corrigibility and interruptibility (at least).
6. Can be combined with future discounting.
7. The framework enables the agent to have multiple conjunctive goals, such that not just ANY of them has to be fulfilled, but ALL of them have to be. Many of the additional goals can be configured as “negative” goals: these additional goals would specifically be about “not causing some things” — that is — NOT disturbing the external world too much. This is in contrast to other goals that are about “positively” modifying anything.
- These additional goals could be for example automatically generated based on a whitelist of permissible changes.
- For a more detailed analysis about one possible way of creating both kinds of goals, see the following links which describe how the proposed framework can be well combined with permissions/whitelist-based safety framework described in another of my essays and which is also described in more detailed manner in another proposal.
8. Bonus: I claim that the setpoint-as-a-goal architecture also enables “insight learning” (a concept in natural intelligence) which enables much more flexible learning, understanding and planning, based on much less data than classical reinforcement learning requires. But I will not elaborate on that here.
For an English example of a somewhat related, but less powerful, approach, see an article about OpenAI algorithm (they call it “hindsight learning”). In Estonian you can additionally read more about insight learning in my thesis about modelling natural intelligence.
- A more detailed formula and analysis developed based on the current post: Diminishing returns and conjunctive goals: Mitigating Goodhart’s law. Towards corrigibility and interruptibility.
- For an additional partially related solution, see my essay about Nomenclature of AI control problem and “Reasonable AI”.
- Overfitting — Wikipedia.
- Taleb’s Concepts of Mediocristan and Extremistan (“Extermistan”).
Excerpt from the text:
“Taleb’s central critique of “The Bell Curve, That Great Intellectual Fraud,” (the title of Chapter 15) is that it is often applied to areas that are subject to the dynamics of Extremistan, even though it only accurately describes Mediocristan.”
- A toy model of the treacherous turn— LessWrong.
Excerpt from the text:
“notice something interesting: The more precautions are taken, the harder it is for [a reinforcement-based agent] to misbehave, but the worse the consequences of misbehaving are.”
“while weak, an AI behaves cooperatively. When the AI is strong enough to be unstoppable it pursues its own values.”
— For me it means that instead of only looking for many precautions we should also strive for a cybernetic / conversational / feedback-based approach.
- “Regressional Goodhart” and especially ”Extremal Goodhart” and “Adversarial Goodhart” subsections in Goodhart Taxonomy — LesserWrong.
Excerpt from the same text:
Mitigation [of Extremal Goodhart]
“Quantilization and Regularization are both useful for mitigating Extremal Goodhart effects. In general, Extremal Goodhart can be mitigated by choosing an option with a high proxy value, but not so high as to take you to a domain drastically different from the one in which the proxy was learned.”
If I understand correctly, the most depressing thing about the Adversarial Goodhart case is that unlike the name says, the agents who turn bad are not necessarily “adversarial” or malignant to begin with. But because of the law, they still become dangerous when they are put under too much precautionary control.
- Specification gaming examples in AI by Victoria Krakovna
- Specification gaming examples in AI — master list
- “The Wright brothers were first to fly because they developed a system of control that depended on feedback.
Everyone else was trying to build stable planes.
The Wright brothers built an unstable plane but developed a control system [that stabilised the plane].”
(YouTube: Norbert Wiener — Wiener Today (1981))
- The Orthogonality Thesis, Robert Miles
- Paul Pangaro — Cybernetics
- Prospect theory — Daniel Kahneman
- Task-based AGI — Eliezer Yudkowsky and others / Arbital
- Mild optimization — Eliezer Yudkowsky and others / Arbital
- Low impact — Eliezer Yudkowsky and others / Arbital
- Satisficing is Safer Than Maximizing — Scott Jackisch / Oakland Futurist “Epistemic Status: less confident in the hardest interpretations of “satisficing is safer,” more confident that maximization strategies are continually smuggled into the debate of AI safety and that acknowledging this will improve communication.”