Reflections on Innateness in Machine Learning
Gary Marcus continues his innateness campaign in a recent arXiv submission (https://arxiv.org/abs/1801.05667). While I have some points of disagreement, I think he raises an interesting question: What innate knowledge, structure, and algorithms are required in a broad-scope AI agent? In this post, I’ll use the shorthand “innate KSA” to denote the knowledge, structure, and algorithms that need to be built into a learning AI system so that, when combined with experience, it can become an AI system with a broad set of capabilities.
Marcus is primarily interested in AI systems that have the same breadth of capabilities as humans. In the final section of his paper, he sketches two methodologies for deciding what innate KSA is needed: The “reductive” strategy and the “top down” strategy. Google Deepmind’s AlphaZero system exemplifies the reductive strategy. The methodology proceeds by building an AI system for one or more narrow tasks and then progressively broadening the set of tasks and adding or removing KSA as needed. If the set of tasks becomes broad enough, then the hope is that the minimal KSA needed to master all of those tasks will constitute the “innate KSA” that we seek to understand. (Marcus calls this “reductive”, because to create AlphaZero, the Deepmind team deleted aspects of AlphaGo Zero, such as rotation and reflection of the game board, that were only appropriate for Go. Perhaps a better name would be “task-driven minimalism”.)
The “top down” strategy involves studying humans (and other intelligent animals) to identify their innate KSA, encoding this innate KSA in AI systems, and then testing those systems to see if they can master the desired range of tasks. (The term “top down” is not a very good name for this; perhaps it should be called “transfer from biology” instead.)
When stated in this way, it is immediately clear that neither strategy is particularly effective. The reductive strategy is biased toward adding tasks that are very similar to the already-mastered tasks, just as we have observed with AlphaZero where chess and shogi were presumably chosen because they were also two-player, complete information, zero-sum games involving moves on a game board. There is a risk that the new tasks will simply be isomorphic to existing tasks and not force any kind of generality in the KSA. One could argue that the generic frameworks of machine learning, such as multi-class classification, multi-label classification, multiple-instance classification, contextual bandits, and reinforcement learning are the typical result of this methodology: We discover a set of tasks that can all be solved by a single generic mechanism, and this is so enticing that we stop trying to generalize the mechanism any further. This is fine as an engineering methodology, but it does not answer the fundamental innate KSA question.
(Aside: Marcus questions the importance of identifying the minimal KSA needed to produce a human-breadth AI system. However, I regard this as the most important scientific question concerning innateness. Marcus keeps reminding us that we need innate KSA, and I certainly agree. The scientific question is, WHAT KSA is needed and how should it be implemented? Machine learning researchers have three motivations for seeking a minimal KSA. First, we seek to address this fundamental scientific question. Second, any extra KSA that is not required is something that could have and should have been learned. A machine learning researcher seeks to have the AI system learn as much as possible. Third, all KSA that is not learned must be hand-programmed. Experience has shown that we are not very good at this kind of programming. The big recent advances in computer vision arose from replacing hand-coded intermediate representations, such as SIFT and HoG, with machine-learned representations.)
The “transfer from biology” strategy is even less effective. It is rarely possible for biologists, neuroscientists, or cognitive psychologists to pin down the precise KSA that is present in biological systems. We can give the KSA names such as “social beings and their states of engagement” (Marcus, quoting Elizabeth Spelke), but it is not at all obvious how to implement these capabilities in an AI system. The capabilities are measured in humans (typically infants) using narrow tasks. It is easy to implement those narrow tasks in an AI system, but the resulting KSA is often useless for supporting subsequent learning. Marcus frequently cites the development of Convolutional Neural Networks (CNNs) as a shining success of this methodology. The origin of CNNs is usually traced to Fukushima’s Neocognitron (1980), which was inspired by early hypotheses of Hubel and Wiesel about the structure of the visual cortex. But even in the original paper, Fukushima notes that “The mechanism of pattern recognition in the brain is little known, and it seems to be almost impossible to reveal it only by conventional physiological experiments. So, we take a slightly different approach to this problem. If we could make a neural network model which has the same capability for pattern recognition as a human being, it would give us a powerful clue to the understanding of the neural mechanism of the brain.” So we observe that even in the case of CNNs, the structure was motivated primarily by mathematical requirements with the hope that it might guide neuroscience, rather than the reverse.
There is a third methodology that is being pursued by the “cognitive architecture” research community (e.g., John Anderson, Allen Newell, John Laird, see https://en.wikipedia.org/wiki/Cognitive_architecture). In this methodology, computational architectures are proposed and then evaluated according to their ability to make quantitative predictions about human performance on various psychological experiments. Each new experiment places additional constraints on the architecture, which guides architectural changes. The role of learning varies in these architectures. The primary focus has been on skill learning and models of short-term memory, although other forms of learning have been incorporated, primarily by treating them as additional tasks. A drawback of this methodology is that it typically requires writing a new “program” for each task. In this sense, cognitive architectures are analogous to computer architectures. They constrain the way computations are organized and executed, but a programmer must still write a program to perform the task. Most tasks that have been modeled involve adult human behavior, so the architectures do not directly address the innate KSA question, although they do provide an interesting platform for studying the question.
A fourth methodology for understanding innate (or at least prior) knowledge is probabilistic programming. Recent years have witnessed great progress in the development of programming languages that make it easy to define flexible and complex probabilistic models and then fit them to data (http://dippl.org/). The big advantage of this approach is that Bayesian statistics provides an elegant theory of learning, and the tools of model analysis and identifiability can be applied to validate the semantics of learned structures. Hence, unlike with neural networks, the learned values of parameters can be assigned useful meanings. However, as with deep neural networks and cognitive architectures, every new application requires writing a new program.
Reflecting on this state of affairs, it seems to me that we lack a strong methodology for studying innate KSA. There are at least three difficulties. First, innate KSA can take many forms. It can be encoded in the procedural structure of an algorithm, in the data structures of a system, or in explicit declarative knowledge (logical or probabilistic). We have difficulty determining whether two different systems implement the same KSA. For example, by using stochastic gradient descent, recent work (https://arxiv.org/abs/1709.01953; https://arxiv.org/abs/1710.10345), suggests that we are implicitly biasing the search to find flat minima (which are known to have superior generalization properties). Second, our notion of “experience” in machine learning tends to be narrow and homogeneous. In supervised learning, we typically assume fixed-length feature vectors (or fixed-dimension images). In reinforcement learning, we similarly assume a fixed structure of the state (as a fixed-dimension object) and the reward, as well as assuming the Markov property. Third, our models of decision making agents are very coarse-grained. We can either study single-agent Markov decision processes (MDPs), single-agent Partially-observable MDPs, or multi-agent stochastic games. Single-agent models are clearly inadequate for modeling social interaction, but stochastic games are so general that it is difficult to find efficient algorithms for learning and problem solving.
An additional complication is that researchers in AI are not all trying to build human-like cognitive systems. In almost every case, AI researchers seek to create systems that have some super-human capabilities. For example, the web search engines are AI systems whose memory and computation speed vastly exceed those of humans, yet their understanding of our written or spoken queries is frequently wrong. Some researchers seek to build theorem proving systems for proving theorems in mathematics or theorems about the correctness of computer programs. To the extent that these systems incorporate learning, the same question of innate KSA arises, but because the tasks differ greatly from those of any biological system, the “transfer from biology” methodology cannot be applied.
To make progress, we need to overcome these methodological challenges. I admire the effort of Deepmind to create a single system that can learn to perform a variety of tasks. I’m also encouraged by systems that aim at “life long learning” in which they must learn to perform a sequence of tasks (without forgetting how to perform the earlier ones). Perhaps by increasing the variety and complexity of these tasks, we can learn more about the required innate KSA.
However, I’m concerned that the whole concept of “tasks” is misguided. AI research typically views intelligent behavior as consisting of behavior on a set of well-defined tasks (an approach I call “taskism”). In contrast, human experience is not segmented into a set of distinct tasks. Instead, a human agent can be viewed as simultaneously performing many different tasks. Anyone who has carried out psychological experiments is painfully aware of this and works very hard to isolate the experimental task from the surrounding context. Our challenge in AI is the reverse: How do we move from isolated artificial tasks to tasks embedded in the complex context of everyday life?