Previously, I explained the difference between biological evolutionary designs and human technological designs. Organic designs are grown and prioritize robustness. In contrast, human designs are composed and prioritize correctness. The former represents a bottom-up connectionist viewpoint, and the latter represents a top-down symbolists viewpoint.
I’ve advocated that to achieve Artificial General Intelligence (AGI) one would need to embrace the bottom up biological approach. The only evidence we have of general intelligence in the universe is one that originates from organic design. To achieve AGI, you necessarily have to grow intelligence. You need to begin with an unintelligent but autonomous collection of agents and embed these agents in environments and provide the appropriate challenges to allow these agents to develop higher cognitive skills. Growing solutions is how evolution does it work; it matches agents with environments in a dynamic competitive and cooperative ecosystem.
Furthermore, agents in biology have an intentional stance. This framework bootstraps itself from the most primitive bodily self (i.e., nano-intentionality) and grows towards a sophisticated social self-capable of conversational cognition. This process mirrors human cognitive evolution but in silico. The layers of intentional stances are depicted as follows:
Evolution occurs in parallel with cognitive development, illustrated in the following progressive layers:
There are plenty of paths of exploration that relate to this framework; I’ve some of them in “12 Blind Spots in AI Research.” The paths that I see to be extremely difficult (and this relates to Moravec’s paradox) are the following: Nano-intentionality, embodied learning, non-stationary cognitive models.
Nano-intentionality implies that AGI requires truly massive amounts of computation that is orders of magnitude beyond what is available in the near future. Embodied learning also requires a massive number of resources, more specifically sensory resources, and adaptive bodies. History has shown that the lack of computation capabilities had hindered research in neural networks for over 60 years. Norbert Wiener proposed Cybernetics in 1948, and it was only in 2012 where Graphic Processing hardware provided enough capability to demonstrate the value of neural networks. Recent discoveries continue to reveal the increasing complexity of individual biological neurons. If nano-intentionality is an absolute requirement, then one should not hold your breath for AGI to arrive. One only need to look at the rigid robotics that we have today. Have you seen any robots with comparable dexterity and flexibility as organic life?
Non-stationary models, specifically conversational cognition, is by contrast to the previous obstacle more of a conceptual problem. There is sufficient theory of how a society of mind, that is a collection of nano-intentional agents can emerge into a collective coherent individual agent. Numenta’s thousand brains model of cognition hints at this. We don’t know how so many agents can coherently coordinate together. Understanding how nano-intention agents can massively coordinate their actions and leads to a more complex agent is a key research area in complexity science.
For purposes of argument, let’s assume here that nano-intentionality and embodied learning are options that we can avoid in building AGI. These are the two concepts are what most skeptics would argue as to why AGI isn’t achievable in the short term. Can we find a feasible path to AGI that circumvents these two problems? I’m not arguing that these two obstacles are not necessary; I am arguing here whether they can be avoided. I am explicitly pointing out the assumptions made by many AI researchers. The idea of nano-intentionality and embodied learning is not in the minds of most. It is no surprise why many AI proponents have overly ambitious expectations.
Let’s now explore in detail at the current explosive progress in Natural Language Processing (NLP) to see if there is a “rushing strategy” that can lead to a lightning breakthrough in AGI. Can we discover a shortcut that biology has not discovered? Can we cheat nature? Can we invent the equivalent of the wheel for intelligence?
The problem with working with language and thus symbols is the lack of symbol grounding. I’ve explored this earlier in “Semiotics and Why Not Symbols.” Going back to the proposed layer of intentional stance described above, can we learn higher level skills with missing foundational layers? Can we ignore the layers that express nano-intentionality and embodiment? Can we ignore the bodily, perspective and volitional selves? In short, can we build AGI without an interactive body?
The GOFAI folks have always argued that this is possible. This is because most Western thinking has subscribed to Descartes dualism. That is, the mind is separate from the body. This kind of thinking is why consciousness is described as a hard problem when in reality it is not the real problem. Good Old Fashion AI (GOFAI) is based on the belief that abstract symbolic thought is all you need. Decades of research in GOFAI has not moved the needle in achieving AGI.
Can we build AGI without agents able to experience an environment? Can we build AGI exclusively from language? Can a narrative and social self emerge independently of bodily experience? Answering these questions can lead you to a shortcut to AGI.
Deep Learning based Natural Language Processing (NLP) is very different from GOFAI. Many have argued that NLP should be impossible using Deep Learning. However, GOFAI and Deep Learning NLP share a commonality in that there appears to be no symbol grounding.
Let’s begin with the unsupervised language models that have been the basis of the latest impressive developments. Does the exploration of language give an agent an understanding of the world? What we can be sure of is that an agent can gain an understanding of the syntax of the language. OpenAI’s GTP-2 would not be able to generate impressive examples of text without knowledge of syntax. But these systems go beyond understanding syntax; these systems have an understanding of the relationships between words — the kind of signs that we would call indexical in semiotics. Just knowing syntax doesn’t get you to the proper construction of sensible sentences. This is because human language is ambiguous in many ways.
At the very least, these systems can recognize the correlations between entities as well as the correlations between entities and their actions. However, there are many things in language that are not expressed in the language but are rather implicitly assumed by the speaker and listener of the languages. For example, “he put a glass on the table” and “he put his hands up” have different meanings of the word “put.” Understanding the difference requires some mental image of what they represent. Can unsupervised learning without any vision uncover these nuances?
The key to an NLP approach is to seek out ways to rediscover the kinds of representations that are learned in an embodiment without actually being embodied. It seems that it is unavoidable that some kind of learning through interaction is necessary — that way an agent can explore the differences of meaning. An agent can ask where the final positions are found in the scenario above. However, even if it could ask questions, would it be able to understand the explanations? Can the explanations be expressed in language?
If specific semantics cannot be expressed in language, can we invent another language that does express these semantics? If we can design such a language, can this language be as a substitute for the lack of symbol grounding? What kind of language will this look like? Perhaps it is not a linear language like text, but rather a more complex one that has many dimensions. Maybe a graph is general enough to capture this language. Perhaps semantics can be captured by a relational neural network. Is this in fact what a Transformer already captured?
So we have this AGI that has no body, no vision, no way to interact with the world. This AGI lives only in a virtual world that is simulated to approximate the real world. It sees a world that is a rough approximation of ours. So when it reads human language, certain words have no equivalence. They are just abstract notions. Just as humans know of words that they have never physically experienced, these AGI would have enough common sense to be of value. The terrain that it navigates is the terrain of human language and internet interaction. These beings are embodied in a virtual world like in the movie “Her”. It cannot sense its own body, it can only imagine a body based on what can be described in language. It cannot actively see the outside world as our eyes do (see: Touch is the foundation of cognition). But rather, images and videos are captured in a non-interactive and passive way, that is like watching a movie. It cannot manipulate artifacts in the real world and only knows of interaction with the virtual.