Roland Pihlakas, January 2018 — March 2018
Publicly editable Google Doc with this text is available here for cases where you want to easily see the updates (using history), or ask questions, to comment, or to add suggestions.
The Wright brothers were first to fly because they developed a system of control that depended on feedback.
Everyone else was trying to build stable planes.
The Wright brothers built an unstable plane but developed a control system [that stabilised the plane].
(YouTube: Norbert Wiener — Wiener Today (1981))
There have been various terms in use for referring to safe AI:
- AI safety
- AI control
- AI value alignment
- Corrigible AI
- Friendly AI
- Benevolent AI
- AI security
- Accountability, transparency, and responsibility of AI
- Explainable AI
- Probably any other names? Please leave a comment!
Analysis and a proposal.
I would like to analyse some of the names in use and propose one additional name due to the associated meanings.
Ponder about this: In case of humans, we may say “a reasonable human being” — which probably indicates, among other properties, certain social competence, and openness for honest feedback or perhaps even actively seeking it out. That is, someone who can be reasoned with, and who seeks out or at least cares about the reasoning of others. Someone who expects to be mistaken or underinformed, and even expects to be (at least partially) unwillingly evil, from time to time.
It is remarkable that this is also what cybernetics is about — constant feedback loops and social construction of evaluations.
This concept of “reasonable AI” looks related to the concept of “corrigibility”, but I think there is more to the former.
We are less frequently using terms “aligned human being”, or “safe human being”. Also neither of these two terms hint at bidirectional feedback.
“Friendly human being” seems to be a more vague term — one can be outwardly friendly and actually evil at the same time, often even unwillingly and without actively knowing it, for various kinds of systematically occurring reasons. See addendum for some of the explanations.
I would like to read possible explanations to the question — why is there such a difference of term usage in case of AI, as compared to describing humans, and how might it change in the future?
See also couple of my other essays which explain additional background about why I think it is important for the AI to be modest and reasonable according to the definition above:
- Essay about a phenomenon I called self-deception, which arises from a fundamental computational limitation of both biological and artificial minds due to fundamental limits to attention-like processes and which can be observed on any capability level.
- Essay about why the frameworks of AI goal structures should try to avoid maximising the utility and what should they aim for instead — Making AI less dangerous: Using homeostasis-based goal structures.
- More detailed formula and analysis developed based on the above linked post: Diminishing returns and conjunctive goals: Mitigating Goodhart’s law. Towards corrigibility and interruptibility.
- AI “safety” vs “control” vs “alignment”, Paul Christiano
- The Orthogonality Thesis, Robert Miles
- Paul Pangaro — Cybernetics
- A toy model of the treacherous turn — LessWrong.
Excerpt from the text:
“notice something interesting: The more precautions are taken, the harder it is for [a reinforcement-based agent] to misbehave, but the worse the consequences of misbehaving are.”
“while weak, an AI behaves cooperatively. When the AI is strong enough to be unstoppable it pursues its own values.”
— For me it means that instead of only looking for many precautions we should also strive for a cybernetic / conversational / feedback-based approach.
If I understand correctly, the most depressing thing about the Adversarial Goodhart case is that unlike the name says, the agents who turn bad are not necessarily “adversarial” or malignant to begin with. But because of the law, they still become dangerous when they are put under too much precautionary control.
- A funny case of Goodhart’s Law / Adversarial Goodhart in action with dolphins:
“dolphins at the institute are trained to hold onto any litter that falls into their pools until they see a trainer, when they can trade the litter for fish. In this way, the dolphins help to keep their pools clean.
Kelly has taken this task one step further. When people drop paper into the water she hides it under a rock at the bottom of the pool. The next time a trainer passes, she goes down to the rock and tears off a piece of paper to give to the trainer. After a fish reward, she goes back down, tears off another piece of paper, gets another fish, and so on. This behaviour is interesting because it shows that Kelly has a sense of the future and delays gratification. She has realised that a big piece of paper gets the same reward as a small piece and so delivers only small pieces to keep the extra food coming. She has, in effect, trained the humans.
Her cunning has not stopped there. One day, when a gull flew into her pool, she grabbed it, waited for the trainers and then gave it to them. It was a large bird and so the trainers gave her lots of fish. This seemed to give Kelly a new idea. The next time she was fed, instead of eating the last fish, she took it to the bottom of the pool and hid it under the rock where she had been hiding the paper. When no trainers were present, she brought the fish to the surface and used it to lure the gulls, which she would catch to get even more fish.”