Tassilo Klein and Moin Nabi (SAP AI Research)
Deep learning has heralded a new era in artificial intelligence, establishing itself in integral parts of today’s world within a short time. Despite its immense power — often achieving super-human performance at specific tasks — modern AI suffers from numerous shortcomings and is still far away from what is known as general artificial intelligence. These shortcomings become particularly prominent in AI’s limited capability in understanding human language. Everyone who has interacted in one way or another with a chatbot or text generation engine (e.g. OpenAI’s GPT-3) might have noticed that the longer the interaction goes on with the machine, the staler it gets. When generating long passages of text, for instance, a lack of consistency and human-feel can be observed. Essentially, this highlights that the model behind does not really understand what it says and does. Rather it is more or less walking along paths of statistical patterns of word usage and argument structure, which it acquired during training from perusing through huge text corpora. This rotelike behavior of replicating statistical patterns reveals the absence of a crucial component: common sense.
But what exactly is common sense? Actually, there exists no clear definition of what it is. It is one of those things we often take for granted and only notice when it is missing. Basically, common sense incorporates aspects from literally everything we deal with — ranging from natural laws, social conventions to unwritten rules. Consequently, the spectrum covered by the concept of common sense is quite broad, explaining the fluffy nature of its definition. Even though common sense is quite generic and applies to all kinds of domains, one particular medium enjoys importance in terms of constituting a popular testbed: natural language. Hence it is no big surprise that injecting common sense into NLP is a fundamental research challenge. And because text processing applications have far-reaching practical implications for consumers, common sense in AI is more than just an academic gimmick.
To better understand why this is the case, let us first look at the shortcomings of current models in more detail.
Why Deep Learning is struggling with common sense?
Among the most significant shortcomings of neural networks is the lack of interpretable behavior in the sense of human-like reasoning paths. This can be attributed mainly to how machines are trained. In the standard supervised learning paradigm, the model is provided with target labels and input data. Then during training using the conventional backpropagation method, the model’s weights are step by step tweaked to reach a state that, to some degree, establishes a mapping from input to desired output target. As this learning procedure is purely goal-oriented, the resulting model has the tendency to resort to some higher level of pattern matching. However, these patterns can be quite complex, and without taking extra precautions, the model is free to choose any solution it likes to achieve the goal mathematically. Unsurprisingly, it is more often than not prone to find some shortcuts, which does not emulate human-like reasoning paths. Human-like reasoning is extremely complex, and its intrinsics are far from being fully understood. However, what is known is its heavy reliance on mechanisms such as conceptualization and compositionality, which are extremely difficult to replicate within a machine. Concepts are mental representations of objects and categories, which according to Murphy, 2002, are “the glue that holds our mental world together” that help to understand and respond appropriately to new entities of previously seen categories. This is tightly connected to what is known as compositionality, which is yet another capability considered to be key for the human capacity in generalization. It is the capacity to understand and produce novel combinations from known components.
The absence of these human reasoning capabilities is precisely what makes machine learning models take shortcuts with its seemingly non-intuitive behavior. This problem becomes particularly prominent in the presence of infrequent but significant events, such as when machines lack generalization schemes. For that reason, those events are also referred to as “black swans”, which highlights the essence of the issue in a more figurative fashion. This quaint metaphor has its origin in the long-prevailing assumption in Europe that all swans are white. A system such as a self-driving car AI might only have been exposed to white swans during training. In the absence of sophisticated reasoning mechanisms, the car control system might react in a rather unpredictable way when confronted with something new. Given the sheer infinite combinatorial space of concept in the real world, mastering black swans requires that a model possess a notion of transfer in terms of concepts. Knowing the concept of “animal” with the subgroup of “swan” and the concept of color, it should be able to connect these both together without having seen this combination before. That’s why mastering black swans entails acquiring the capability to conceptualize during training to facilitate a transfer of concepts. However, as the space of combinations is huge, plausibility gauging at inference time is crucial, which directly connects it to common sense. Commonsense reasoning, with its inherent ambiguity in terms of concepts and their relationships, constitutes a case in point in this regard. To truly reason about common sense, a model has to come up with a process of concept disentanglement and compositional inference. Now as we know a bit more about common sense and its importance and touched the intersection with AI — how is common sense actually defined in the AI space? If you expect a crisp definition, you might be again disappointed. However, one of the first definitions of commonsense in AI was put forward by AI pioneer John McCarthy, who actually coined the term ‘artificial intelligence.’ In his seminal work “Programs with Common Sense” (1958) he wrote
“We shall therefore say that a program has common sense if it automatically deduces for itself a sufficiently wide class of immediate consequences of anything it is told and what it already knows.
Our ultimate objective is to make programs that learn from their experience as effectively as humans do.”
Assessing Commonsense Reasoning
Given the vagueness of commonsense reasoning, we need a somewhat objective measure to assess the commonsense reasoning capabilities of all the programs to check these claims. However, you might have already guessed it — this is everything but a trivial endeavor. One of the most well-known challenges in this regard is the Winograd Schema Challenge (WSC), which was devised as an alternative to the famous Turing Test. A Winograd schema is a pair of sentences containing two nouns, differing in as little as one word with ambiguous pronouns. The challenge involves resolving the pronoun correctly to one of the nouns. The difference in the sentences leads to flipped solutions for each sentence in the pair. A key characteristic of the test is that humans are able to resolve the pronouns with no difficulty, whereas AI without commonsense reasoning cannot distinguish candidates. Therefore human experts created the set of challenge tasks, incorporating different kinds of common sense entities.
To make things a bit more concrete, let us look at a very popular example of WSC:
1) The trophy doesn’t fit in the suitcase because it is too small.
2) The trophy doesn’t fit in the suitcase because it is too big.
Answers Candidates: A) the trophy B) the suitcase
In this example, the nouns are “the trophy” and ‘the suitcase,” with the ambiguous pronoun being “it.” As can be seen, changing the adjective from “too small” to “too big” changes the direction of the relationship, which makes the tasks extremely hard. Thus resolving this entails the conceptualization of an item (trophy) and a container (suitcase) via the relation (fitting). Therefore it should be clear that understanding the high-level concepts behind allows us to resolve all kinds of combinations for resolution, i.e., replacing the suitcase with some other container, the AI system should still come to the same conclusion.
Now that you are familiar with common sense and a way to test it, we will discuss how common sense reasoning has been approached technically.
Commonsense Reasoning in AI
A lot of time has passed since the definition of commonsense in AI was put forward by John McCarthy in the 50s. However, despite the recent advances in machine learning, not much has changed in terms of true commonsense reasoning capabilities. However, this has changed recently with the topic regaining popularity, which can be attributed to the recent progress in NLP and the importance of the task. Unsurprisingly, there exist a plethora of approaches to tackle commonsense reasoning, which can roughly be clustered in three groups:
- Rule and knowledge-based approaches
- Generic AI approaches
- AI language model approaches
Current best-performing approaches are from the latter category. The underlying assumption of these methods is that their training corpora, such as encyclopedias, implicitly contain some commonsense knowledge that the model can usurp. However, this assumption is problematic because such texts barely incorporate common sense due to the assumed triviality. These methods usually function in a two-stage learning pipeline. Starting from an initial self-supervised model, commonsense-aware word embeddings are then obtained in a subsequent finetuning phase. Fine-tuning enforces the learned embedding to solve the downstream WSC task only as a plain coreference resolution task. Additionally, to fully utilize the power of language models, conventional approaches require annotated training data in terms of what is right and wrong. However, the creation of large labeled datasets and knowledge bases is cumbersome and expensive as it is done manually by experts. This applies particularly to commonsense reasoning, where compiling the complete set of commonsense entities of the world is intractable, due to the potentially infinite number concepts and combinations.
Language models capture the probabilities of word occurrence based on the text they are exposed to during training. Apart from capturing the word statistics, neural language models also learn word embeddings from representation and raw text data. The recently proposed BERT picks up the notion of language modeling in a slightly different way. Instead of optimizing a standard language model objective — modeling the word probability given a preceding context — BERT has a pseudo-language model objective. Specifically, BERT leverages what is known as a masked language model that is trying to complete sentences, from which words were replaced by a mask (“_____”) randomly.
“The trophy does not fit into the suitcase, because ____ is too big.”
To solve this task, the model trains a so-called attention mechanism. It provides cues such as to which words the model might want to pay more attention to when solving a task. In terms of the preceding example, more attention to the word “trophy” than “suitcase”, because knowing that the subject is trophy is more plausible. However, as we will see shortly, filling in words like this is particularly challenging due to the inherent ambiguity and technically requires a notion of common sense. Apart from improving the performance of models, self-attention also suggests providing insights into a model’s inner working. This is quite a desirable property as Deep Learning is often taunted as being a black box. In addition to the masked word prediction, training BERT entails another auxiliary classification task. Specifically, it is a binary classification objective, predicting whether two sentences are successive. All this taken together, yielded embeddings that could be easily transferred by finetuning to a wide range of downstream tasks, which propelled the domain of NLP to a new era.
What’s up next?
In the next blog post to be published soon, we will discuss two approaches for commonsense reasoning developed at SAP AI research that leverage the BERT language model, and have recently been published at ACL (Annual Conference of the Association for Computational Linguistics) — the premier conference of the field of computational linguistics. The focus of research has been geared towards having algorithms with minimal supervision to not establish shortcuts for shallow task solving. Thus, we will start with an unsupervised approach that directly exploits self-attention of the BERT language model without any further finetuning. Afterward, we will present a more powerful approach that operates in a self-supervised fashion, which outperforms supervised methods despite being only weakly supervised.