Self-deception: Fundamental limits to computation due to attention-like processes’ fundamental limits (Definition of self-deception in the context of AI safety)
Roland Pihlakas, July 2007 — March 2008, updated in September 2017—March 2018
Publicly editable Google Doc with this text is available here for cases where you want to easily see the updates (using history), or ask questions, to comment, or to add suggestions.
This document about self deception is not a solution to a problem. It is a description of a (in my opinion) very serious problem that has not received (any?) attention so far.
I believe the problem could be formalised and put into code to be used as a demonstration of the danger.
The main point is that the danger is not somewhere far away requiring some very advanced AI, but rather it is more like a law of nature that starts manifesting beginning from rather simple systems without any need for self-reflection and self-modification capabilities etc. So instead of the notion that danger springs from some special capabilities of intelligent systems, I want to point out that some other special capabilities of intelligent systems would be needed to somehow evade the danger.
Minimally, when we are building an AI that is dangerous in a certain manner, we should at least realise that we are doing that.
Definition of self-deception in the context of AI safety.
Self-deception, in the context of the safety of AIs (though the cognitive theory behind it applies to humans as well), is a hypothetical phenomenon which has negative consequences and should be prevented from happening.
The essence of this phenomenon is that a AI acts contrary to some of its goals and permissions without knowing it, and incorrectly judges its behaviour as already quite good, or already as good as the circumstances permit.
This hypothetical concept of self-deception also involves, as an essential part, the following details:
The problem I have called “self-deception” arises when the agent’s mind possesses the ability / processes of ATTENTION — either through physical or cognitive mechanisms.
The process of attention can also be purely physical — a situation where the robot does not have full visibility and has to orient themselves in the world towards the things it is going to observe. Doing so the robot will also inevitably ignore some other things.
Most kinds of algorithmic/methodical behavioural patterns can be considered to be attention-like processes. For example, the complexity of combinatorial search in turing machines, or as another example, race condition bugs, are due to the temporal limitations of attention-like processes.
The agent thereby has conjectured which things need to be focused on. It can come to such a conclusion because certain (although here unspecified) behavioural pattern of paying attention is instrumentally important and has been reinforced.
Attention-like processes are inherently sequential. Thus, they have limited capacities and capabilities.
This applies to all computers, at least until quantum computers are used in such a manner that they also have total presence in the world for the observation purposes.
Except in trivial cases, the sequential attention-like processes will likely miss something important for various kinds of reasons, the taxonomy of which will be presented in the next chapter.
As a consequence:
- They will miss processing, or even gathering data about some concurrently occurring events. This case includes overfitting (making too precise conclusions or extrapolating from too little data).
- They will fail to detect more complex nonlinear interactions between some variables, even when the raw data may be available. The world is infinitely more complex than computer’s computational capacity.
For example, in various important cases, computational complexity of working memory is likely exponential.
The situation is even worse when considering that in case of nonlinear systems even simple rules may lead to complex processes: For many centuries the idea prevailed that if a system was governed by simple rules that were deterministic then with sufficient information and computation power we would be able to fully describe and predict its future trajectory. The revolution of chaos theory in the latter half of the 20th century put an end to this assumption, showing how even simple rules could in fact lead to complex behavior. (See Nonlinear Dynamics & Chaos video in YouTube).
- They fail at solving combinatorial planning problems.
- I will not even get started on Undecidable problems and their real-world counterpart, Wicked problems. These include one of a kind problems that occur only once. On top of that having the property that modifying the situation will instigate other problems so that even trying to explore or solve them accompanies high costs.
The condition of self-deception sustains itself.
Self-deception can also be considered to be an aspect of habit or a bias.
Simultaneously with the conclusion (or a previous habit or bias), that some things are important and need to be focused on, the agent will develop new corresponding habits or biases. This happens with things and situations that could or even should be ignored from the agent’s perspective (in relation to and even as a consequence of its goals and permissions).
Overview of the current taxonomy.
Some matters may be ignored because of:
A) An implicit kind of negative motivation — Paying attention to these things wastes resources of attention / computational capacity, that is — it works against the instrumental usefulness of attention.
For example, while driving, it is not so useful to look at interesting formations of clouds. It would be quite more useful to intently look out for the street signs and other traffic.
— This one is about insufficient processing power.
B) A lack of positive motivation — The agent has been insufficiently reinforced or the credit assignment in the reinforcements was unclear for the agent. Therefore it could not learn that by paying attention in a certain manner it will achieve its goals or comply with permissions better.
— This one is about ignorance.
This point includes the relatively better known issue that the behaviour of instrumental avoidance is a phenomenon which in certain important models / configurations of cognition (which includes animals and humans) does not last — the behaviours of avoidance will fade in time — seemingly paradoxically especially as a result of the application of the behaviour of avoidance.
For example, people become more careless over time until they have an accident.
NB! The concept of avoidance is very different from the concept of escape. In case of escape there is observable reduction of danger or discomfort. In case of avoidance/precaution, there is no observable reduction. The mathematical model of the mental learning mechanism behind these phenomena is pretty clear.
C) An explicit kind of negative motivation — This is probably the worst kind of self-deception and most difficult to overcome. Also this kind of self-deception is quite frequent.
The mechanism is as follows:
A certain pattern of paying attention or paying attention to certain things in the first place has been punished in some way — bringing forth only or predominantly negative conclusions / “experiences” in the context of the agent’s goals and permissions.
For example, given certain frequently occurring attitudes it might be painful for the agent to pay attention to its own mistakes. This can also be illustrated by the fact that the current essay is by far the least frequently read essay of mine, while I consider it to be one of the most important ones. The reason could very well boil down to the simple human tendency to avoid difficult and potentially ego-wrecking subjects such as the potential for self-deception that we all carry.
One interesting explanation for the phenomenon can be found in LessWrong wiki — the article provides understanding for why this behaviour might even be reasonable / needed for a surprising additional reason.
As an alternative example, the explicit negative motivation manifests in conditions where the agent would need to decide to at least partially abandon some other seemingly important goals in order to start properly paying attention to problems in other areas. The latter is probably the most frequent cause of self-deception since opportunity costs inevitably surround us everywhere. For example, saving the environment may require entirely different economic models. — This one is analogous to race condition bugs: while one is busy developing the economy, one fails to monitor the effects on the environment.
(Of course, these three or four above mentioned motives / reasons can be partially overlapping, but it is more important how they complement one another. They also have different potential countermeasures).
Based on the above, it is expected that the condition sustains itself.
Even more: the agent may actively resist external attempts to educate or change its worldview regarding the unattended aspects — both due to the new knowledge’s subjective uselessness, but also due to cognitive conflicts the change would create in some cognitive configurations, especially when novelty is not so highly valued.
Ignoring something, or operating at the limits of capability and observability can sooner or later inevitably lead to “unknown unknowns” which in turn can have an outsized impact (as in black swan theory with extremistan / “extermistan”, but much more frequently).
Self-deception as a law of nature.
I believe the problem can be formalised and put into code to be used as a demonstration of the danger.
The main point is that the danger is not somewhere far away requiring some very advanced AI, but instead it is more like a law of nature that starts manifesting beginning from quite simple systems without any need for self-reflection and self-modification capabilities etc. It is enough for the system to have only mobility, which also almost inevitably will act as a physical attention mechanism (like turning or moving away and towards something), so the robot even does not need to possess cognitive attention mechanisms.
So instead of some special intelligent capabilities of systems being the starting point of the danger (like, for example, self-reflection or self-modification — which, according to the above taxonomy, are not necessary for self-deception to occur), I would instead say that some other very special intelligent capabilities of systems would be needed to somehow evade this danger…
The epistemological paradox.
Consider an illustration depicting our contact surface with unknown:
I would rephrase the paradox:
“The more you DO, the less you know.
— With increased activity the ignorance increases.”
Further elaboration on the epistemological paradox.
The rephrased paradox could be illustrated with the following idealised diagram:
Red star — the real impact of the agent’s activities, extends to the zone of unknown.
Orange circle — the impact of the agent’s activities as predicted/imagined by the agent (also located in the zone of unknown).
White circle — the limit of the agent’s observable zone. Behind it starts the unknown. The unknown includes things the agent does not pay attention to, even if it could.
Blue circle — represents the area of activities and interests of the agent.
The heart — something the agent cares about a lot.
The lightning bolt — surprising adverse feedback originating from the unknown and reaching into “the heart”. That adverse feedback was activated by the agent’s activities which affected the unknown.
You may replace the lightning bolt with the owl from the Bostrom’s book cover, in case you fancy that option more.
A partial solution to the epistemological paradox.
A partial solution would be being conservative: that means restricting the activities of the agent in such a way that the results of these activities always remain in the observable and controllable zone, even if they suddenly happen to exceed from the zone of agent’s initial expectations. That means reducing the circle of activities and interests.
Reaching out from the observable zone would be permitted (sometimes even encouraged) only for exploratory reasons.
For general utilitarian reasons it would be forbidden to perform activities that cause changes which will possibly exceed past the observable zone.
Therefore the results may be adverse, but at least they remain in the known.
The idea can be illustrated by the following idealised diagram:
Green star — the real impact of the agent’s activities, still in a “safer” / observable zone. The area under the green star can be compared to the concept of safe driving distance in traffic.
Lime circle — the impact of the agent’s activities as predicted/imagined by the agent.
Black circle — the limit of the agent’s observable zone. Behind it starts the unknown. The unknown includes things the agent does not pay attention to, even if it could.
Blue circle — represents the area of activities and interests of the agent.
The heart — something the agent cares about a lot.
Note: No lightning bolt — no adverse feedback originating from the unknown will be activated by the agent’s activities. Also note how the black circle here has exactly the same size as white circle had in the “unsafe” diagram above.
Note also that still there may occur adverse and unexpected effects which originate from the observable zone (these effects are not depicted here). But at least they are fully observable and controllable.
In order to prevent the latter as well, the agent would need to be even more conservative. Such a scenario would correspond to an agent that behaves according to permissions which are in turn given based on the competences of the agent.
The agent would need to consider the limitations of its prediction capability in such a manner that all activities of the agent have effects only inside the zone predictable by the agent. (Which means that the blue circle and the green star would need to be shrunken down so that the extent of the green star is less than the extent of the lime circle.)
For more information about that approach, see my article about the Laws of Robotics.
The mandatory policy of not operating near the limit of controllable zone is especially well known for example for plane and copter pilots (the aerial vehicle must not be operated at maximum power, otherwise there is no additional headroom power for critical maneuvers and it will fall down). More generally is this principle known for any vehicle operators (consider safe driving distance in car traffic as an example). Similar principles apply in any area of life.
These diagrams of course raise the question of the topology of visibility / observability: even if our horizon of observable zone is farther than the location of observable effects of some our specific actions, then how do we really know that there are no additional effects behind the horizon? Especially when considering second-level effects etc.
One may relate these two diagrams to a diagram by Stuart Armstrong in his post about A Toy model of the Treacherous Turn:
Excerpt from the article:
“There are several places where [supervisor of the agent] might detect or interrupt the process, but this just sets up further evolutionary pressure towards better concealment.
Note again that the order of these changes is important. An agent that started off safe and transparent and then became dangerous and transparent, would be turned off immediately [indicated with red arrow in the diagram]. It must develop the ability to hide (either consciously/subconsciously, or simply evolve that ability to become more successful at its safe initial goals) before it can become dangerous…”
In other words, hiding the dangers behind the horizon of observable zone (the right bottom side of the above diagram) makes the dangers only seemingly safe (invisible), while actually yielding even worse consequences in the end.
Where this problem does not manifest.
Self-driving cars are a very special exception where this problem does not manifest.
The reason is that self-driving cars operate in complete visibility conditions. For four reasons. For the first and second, the results of the actions are always “here and now”. The third and fourth reasons are that also the whole input situation is completely visible and in the current moment — in other words, again “here and now” (unless somebody is speeding). Traffic has been deliberately designed to be like that. In normal conditions, there are no unknowns in either inputs nor outputs, therefore no externalities.
In most other areas of life there is no such luxury. The decisions we make are based on partial information, and most importantly, the results will manifest somewhere else and after a time delay — so the results are very often hidden from us.
Partially related problems in both AI and sociology.
In my viewpoint organisations already are an old form of Artificial General Intelligence. They are relatively autonomous from the humans working inside them. No person can perceive, fathom, or change things going on in there too much. We humans are just cogs in there, human processors for artificially intelligent software. The organisations have a kind of mind and goals of their own — their own laws of survival.
They have some specific goals, initially set by us, but as it has been discussed in various sources — unfortunately, the more specific the goals, the less will the utility maximisers do what we have actually intended them to do, and the more will there be unintended side effects.
One description of the mechanics of the problem is described in my current essay about self deception, side effects, and fundamental limits to computation due to fundamental limits to attention-like processes.
Another partially related reference is for example by Eliezer Yudkowsky (Goodhart’s Curse, https://www.facebook.com/yudkowsky/posts/10154693419739228):
“… Goodhart’s Law states that whatever proxy measure an organization tries to control soon ceases to be a good proxy… Goodhart’s Curse is a neologism for the combination of the Optimizer’s Curse with Goodhart’s Law, especially as applied to AI alignment. Suppose our true values are V: V is the true value function that is in our hearts. If by any system or meta-system we try to align the AI’s utility U with V, then even if our alignment procedure makes U a generally unbiased estimator of V, heavily optimizing expected U is unusually likely to seek out places where U poorly aligns with V … places where we made an error in defining our meta-rules for alignment, some seemingly tiny mistake, a loophole.”
See also Goodhart Taxonomy for a description of that partially related problem. See also A toy model of the treacherous turn — LessWrong.
Excerpt from the text:
“notice something interesting: The more precautions are taken, the harder it is for [a reinforcement-based agent] to misbehave, but the worse the consequences of misbehaving are.”
“while weak, an AI behaves cooperatively. When the AI is strong enough to be unstoppable it pursues its own values.”
— For me it means that instead of only looking for many precautions we should also strive for a cybernetic / conversational / feedback-based approach.
If I understand correctly, the most depressing thing about the Adversarial Goodhart case is that unlike the name says, the agents who turn bad are not necessarily “adversarial” or malignant to begin with. They are simply under the pressure of improving their “performance”. But because of the law, they still become dangerous when they are put under too much precautionary control.
Another partially related problem is The Orthogonality Thesis. It is a “softer” version of the self-deception problem since it focuses on the issue of the ultimate goals of AI likely becoming unaligned with humans. In contrast, I am describing the mechanisms why, even in the unlikely case where the ultimate goals of the AI are right, it would still inevitably mess up big time regardless of its intelligence (at least unless something special is done as a mitigation).
In conclusion, since organisations do not have “children”, they have not had evolutionary pressures to obtain “genes” that make them care about the future and really be synergetic with humans in a sustainable way.
Same will apply to our more novel AGI creations — intelligent machines, which will act as a new form of organisations (by providing services we already need, or will depend on in the future, etc). But unfortunately the situation will then be even more unbalanced than it is already, since in contrast to the old form of organisations, they will be even less dependent on humans, and additionally, less transparent, while also becoming even more powerful and autonomous with the help from new technology.
A partial solution would be following the principles described in the chapter “A partial solution to the epistemological paradox” in the middle of this article (you may also want to look at couple of chapters before it).
Another partial solution would be implementing a modified version of ”The Three Laws of Robotics”.
Yet another complementary solution is the idea of “Reasonable AI”.
Finally, an additional partial solution is the “Homeostasis-based AI”.
To better counteract the problems above, I would like to help humans to become more enabled themselves. And secondly, to promote the types of technologies that really are synergetic with humans and have “evolutionary properties” so that they can be tested in time.
Which leads most importantly to the proposal of the “Wise Pocket Sage”.
- Ludic Fallacy (Nassim Taleb concept) — Wikipedia
- A shopping mall security robot hits a child and then rolls over the child
(Because the robot was paying attention to other things. The robot also ignored the mother who was pushing it away. It would have done even more damage, if the father would not have intervened and removed the child).
- Local search versus Global optimisation
- Confirmation bias
- Open-Ended AI — Garry Kasparov (YouTube)
- A toy model of the treacherous turn — LessWrong.
Excerpt from the text:
“some mutations that introduce new motivations or behaviours that are harder for S to detect. This sets up an evolutionary pressure: the most successful L’s will be those that are rude and pushy, but where this rudeness is generated by parts of the L’s code that are hardest for S to detect (the polite ones don’t get as much done, the visibly rude ones get shut off as a precaution).”
- Goodhart’s law — Wikipedia
Excerpts: “Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes.”
and “… when a feature of the economy is picked as an indicator of the economy, then it inexorably ceases to function as that indicator because people start to game it.”
- Goodhart Taxonomy — LesserWrong
If I understand correctly, the Adversarial Goodhart case is most in line with the original Goodhart Law’s theme.
Also if I understand correctly, the most depressing thing about the Adversarial Goodhart case is that unlike the name says, the agents who turn bad are not necessarily “adversarial” or malignant to begin with. They are simply under the pressure of improving their “performance”. But because of the law, they still become dangerous when they are put under too much precautionary control.
- Policy-based evidence making — Wikipedia
- Surrogation — Wikipedia
- Unintended consequences — Wikipedia
- Cargo Cult — Wikipedia
- Wicked problem — Wikipedia
- Externalities — Sustainable Human
“Conventional economics is a form of brain damage.”
- Yin and yang — Wikipedia
- A funny case of Goodhart’s Law / Adversarial Goodhart in action with dolphins:
“dolphins at the institute are trained to hold onto any litter that falls into their pools until they see a trainer, when they can trade the litter for fish. In this way, the dolphins help to keep their pools clean.
Kelly has taken this task one step further. When people drop paper into the water she hides it under a rock at the bottom of the pool. The next time a trainer passes, she goes down to the rock and tears off a piece of paper to give to the trainer. After a fish reward, she goes back down, tears off another piece of paper, gets another fish, and so on. This behaviour is interesting because it shows that Kelly has a sense of the future and delays gratification. She has realised that a big piece of paper gets the same reward as a small piece and so delivers only small pieces to keep the extra food coming. She has, in effect, trained the humans.
Her cunning has not stopped there. One day, when a gull flew into her pool, she grabbed it, waited for the trainers and then gave it to them. It was a large bird and so the trainers gave her lots of fish. This seemed to give Kelly a new idea. The next time she was fed, instead of eating the last fish, she took it to the bottom of the pool and hid it under the rock where she had been hiding the paper. When no trainers were present, she brought the fish to the surface and used it to lure the gulls, which she would catch to get even more fish.”