A reply to “Aligned Iterated Distillation and Amplification” — some problem points
Roland Pihlakas, 20. April 2018
Publicly editable Google Doc with this text is available here for cases where you want to easily see the updates (using history), or ask questions, to comment, or to add suggestions.
Posted in reply to Paul Christiano’s post:
I have probably misunderstood some of the topics, also my post is terse and does not contain any proofs, and you may already have thought about these issues — in such case sorry about that.
But the deadline is already starting to make a whooshing sound, so I better post the initial quick notes I made during the reading of the announcement. If I happen to have more time later, I can always post more later… I have learned to trust my intuition, so hopefully there is at least some use to the following hints :)
Summary of potential problems spotted regarding the use of AlphaGoZero:
- Complete visibility vs Incomplete visibility.
- Almost complete experience (self-play) vs Once-only problems. Limits of attention.
- Exploitation (a game match) vs Exploration (the real world).
- Having one goal vs Having many conjunctive goals. Also, having utility maximisation goals vs Having target goals.
- Who is affected by the adverse consequences (In a game vs In the real world)? — The problems of adversarial situation, and also of the cost externalising.
- The related question of different timeframes.
Summary of clarifying questions:
- Could you build a toy simulation? So we could spot assumptions and side-effects.
- In which ways does it improve the existing social order? Will we still stay in mediocristan? Long feedback delay.
- What is the scope of application of the idea (Global and central vs Local and diverse?)
- Need concrete implementation examples. Any realistically imaginable practical implementation of it might not be so fine anymore, each time for different reasons.
Potential problems spotted regarding the use of AlphaGoZero.
- The AlphaGoZero operated in complete visibility conditions. For various reasons, conclusions from such an environment might not apply to the real world.
Some of the problems with incomplete visibility I have highlighted in my essay about fundamental limits to computation due to fundamental limits to attention-like processes:
- The AlphaGoZero operated in a domain where it additionally had almost complete experience. In other words, it had had the privilege of having played the same situations over and over.
In the real world, in the case of many most important problems, there is no self-play.
Instead, there are many once-only problem situations.
On top of that these situations often have the property that modifying the situation will instigate other problems so that even trying to explore or solve them accompanies high costs. (Also mentioned in my essay).
Once-only problems probably cannot be solved well by any fixed policy. In the cases where they can be solved at all (though in many cases, probably they can not be!) — they need a more insightful problem solving approach by combining pieces of existing experience into novel action-plans.
- AlphaGoZero probably did not have any exploration-versus-exploitation dilemma during the competition itself.
In the real world there often is no sandbox versus competition distinction — you need to explore all the time.
Otherwise you may end up putting blinders on yourself and finding confirmations to existing hypotheses, even though they may be incorrect after all. In other words, there is difference between taking the most subjectively likely steps to success (as in a greedy approach, which most likely will end up in local minima) on the one hand, and the most informative steps (as in non-greedy approach) on the other hand.
Let us visualise a game where I have imagined something in my head and you have to figure out what it is, by asking questions and getting yes-no answers from me. After we have found out that the imagined thing is red, it is not an advisable strategy to try to finish the game early (like exploitation-approach would do) by starting to enumerate all red objects in the world. Instead it would be a good strategy to continue the exploration by maximising the entropy and asking about anything that has about 50% of subjective chance to get either yes or no answer, given current data (which also of course means avoiding questions about any totally-low-probability properties).
- AlphaGoZero had only one goal: to maximise the game points and then win.
In a real world scenario there are many goals which cannot simply be summed up, since they all need to be met simultaneously, to a reasonable degree (both over space and time). Note also, that often these many goals are such that they need to be met, not maximised. How well does the proposed framework handle such cases?
This topic is slightly elaborated in another of my essays: https://medium.com/threelaws/making-ai-less-dangerous-2742e29797bd and is developed in much more detail here: “Exponentially diminishing returns and conjunctive goals: Mitigating Goodhart’s law with common sense. Towards corrigibility and interruptibility.”
- Finally, in case of board games, the machine will be very interested in winning and not cheating, because if it causes any adverse side-effects, it will the be sole sufferer.
This is again totally not so in real life. In real life the machine will have to serve someone else’s goals, which is kind of an adversarial situation to begin with. (See A toy model of the treacherous turn and also the Adversarial Goodhart chapter in this article. This problem and many other related themes are well-known in sociology.) It would be interesting to hear the considerations, why would that law not apply this time?
But on top of that there is the problem that the affected party may always be somebody, who is left out of the equation — located on the other side of the world or living in the future. In other words, the problem of externalisation of costs.
How or why would the proposed algoritm solve these issues?
- At which timeframes does the algorithm expect to get its feedback? Maybe it would require a really long time of gathering the feedback in order to settle to sustainable strategies, as opposed to settling to greedy strategies? I guess this problem is essentially a special form of the Adversarial Goodhart problems.
Perhaps more importantly, I would like to offer a few proposals for future elaboration of your ideas, and for improving the discussion.
- A proposal: Could you build some simulation about how do you think your idea would work? It would be probably easier to highlight various problems with your assumptions or side-effects once we can see your (evolving) simulation.
- An additional question is to explain in more detail, why the proposed AI system would provide something better than the existing social order?
One of my motivations behind this question is that humans are relatively slow. One of the goals of building a strong AI seems to be to speed everything up. But perhaps humans being slow is a good thing after all, since that enables going through various social processes at lower “temperature”, that is — slower, with less momentum, more diversity, and on top of that, living in a “mediocristan” domain. Which in turn enables having more useful time-tested past experience, more linear-ish rules, and gathering more feedback (since many important real-world phenomena have a long feedback delay), and also having less inertia and more diversity in case something needs to be changed or adapted to.
- Yet another question is, what is the application of the proposed idea? Do you intend to build some central governing structure or to have many small AI-s that support humans in a more distributed and diverse manner?
- The general motivation behind these additional questions is that in order to cut through the foundations of your proposal, the proposal has to be grounded in some specific manner. If it is a very general pie in the sky, then there is nothing concrete to attack.
It may turn out that your idea is nice after all, but any realistically imaginable practical implementation of it might not be so fine anymore, each time for different reasons.
- Essay about a phenomenon I called self-deception, which arises from a fundamental computational limitation of both biological and artificial minds due to fundamental limits to attention-like processes and which can be observed on any capability level.
- Essay about why the frameworks of AI goal structures should try to avoid maximising the utility and what should they aim for instead.
- Exponentially diminishing returns and conjunctive goals: Mitigating Goodhart’s law with common sense. Towards corrigibility and interruptibility. — A more detailed formula and analysis developed based on the above linked post.