Two minutes NLP — How to measure Commonsense Reasoning

Winograd, OpenBookQA, Textual Entailment, Plausible Inference, and Intuitive Psychology

Fabio Chiusano
NLPlanet
5 min readFeb 16, 2022

--

Examples of commonsense tasks in NLP. Image by the author.

Hello fellow NLP enthusiasts! Today I did some research on a very interesting area of ​​NLP: commonsense reasoning. In particular, I explored the ways that are used today to measure the ability of models in using commonsense knowledge. Enjoy! 😄

What is Commonsense Reasoning

To understand language, humans use a variety of knowledge and reasoning. For example, consider these sentences:

Jack needed some money, so he went and shook his piggy bank. He was disappointed when it made no sound.

From this, we can deduce that Jack did not find any money, and because of that, he felt negative emotions. Our reasoning process, which is often called commonsense reasoning, enabled us to connect pieces of knowledge so that we could reach this conclusion, which was not stated explicitly in the passage. Piggy banks hold coins (not pigs) and coins are metal money. If a container such as a piggy bank is shaken, the coins will make a sound, because metal is a hard solid. If there is no sound, then there are no coins. As piggy banks are typically owned by children, it is likely that Jack is a child.

How we all probably imagine the scene.

We may be able to make these predictions based on similar experiences as children, and we can make similar conclusions by analogy. The ability to understand and reason this way naturally comes to humans, but it is much more challenging for machines. Even though significant advances in natural language processing have been made in the last few decades, machines are still a long way from being able to make this kind of Natural Language Inference (NLI).

To acquire and represent common knowledge, first attempts used methods such as extractive methods or crowdsourcing to acquire large knowledge graphs. Advances in pre-trained language models, however, have pushed machines closer to human-like understanding capabilities, which raises the question of whether machines should directly simulate commonsense through symbolic integration. Although these models exhibit impressive performance improvements in a variety of NLP tasks, it still remains unclear whether they are performing complex reasoning or simply learning complex surface correlation patterns.

In response to this difficulty in measuring progress in commonsense reasoning by using downstream tasks, efforts to develop robust benchmarks have increased.

How to measure Commonsense Reasoning

Turing may have been the first to propose an experiment, widely known as the Turing test, to test whether a machine had intelligence. However, the Turing Test has been criticized by some for encouraging machines to deceive humans, failing to provide continuous feedback to allow for incremental development, and being impractical in multiple ways. The Turing Test has therefore been replaced by benchmark datasets, which offer training and testing data, evaluation frameworks, and continuous numerical feedback on various language tasks.

Here follows a list of the most used benchmarks for commonsense reasoning.

Reference Resolution

Reference resolution is the process of determining which reference, typically a linguistic mention in a text, a particular expression, e.g., a pronoun or phrase, refers to. In the presence of multiple entities in a sentence with pronouns, this process can be significantly complicated, leading to a need for external knowledge, such as common sense knowledge, to inform decisions.

A benchmark for Reference Resolution with commonsense reasoning is the Winograd Schema Challenge. In this challenge, systems are presented with questions about sentences known as Winograd schemas. An answer requires disambiguation of a pronoun whose referent may be one of two entities and can be changed by changing just one word. Because of this, the linguistic context is rarely useful in disambiguating the pronoun, and outside knowledge is required.

Example data from the Winograd Schema Challenge. Image from https://arxiv.org/pdf/1904.01172.pdf.

Question Answering

Most benchmarks don’t provide a focused language processing task like reference resolution, but instead, test a combination of language processing and reasoning skills within one task. Question answering (QA) is a comprehensive task, especially the recent version of the task, in which a system is given a passage and asked questions about it to demonstrate its understanding of the passage.

A benchmark for Question Answering with commonsense reasoning is the OpenBookQA dataset. This dataset contains about 6,000 4-way multiple-choice science questions which may require science facts (common knowledge) or commonsense knowledge.

Example data from the OpenBookQA dataset. Image from https://arxiv.org/pdf/1904.01172.pdf.

Textual Entailment

The goal of Textual Entailment is to find a directional relationship between a text and a hypothesis, where the text entails the hypothesis if a typical person would infer that the hypothesis is true given the text.

A benchmark for Textual Entailment with commonsense reasoning is RTE Challenges. The Recognizing Textual Entailment (RTE) Challenge aims to evaluate a machines’ acquisition of the common background knowledge and reasoning capabilities a typical human needs to determine whether one text entails another.

Example data from the RTE Challenge. Image from https://arxiv.org/pdf/1904.01172.pdf.

Plausible Inference

In comparison with textual entailment benchmarks that require one to draw concrete conclusions, others require one to infer hypothetical, intermediate, or uncertain conclusions from a limited context, which is referred to as plausible inference.

A benchmark for Plausible Inference with commonsense reasoning is COPA. The Choice of Plausible Alternatives (COPA) task evaluates causal reasoning between events, which requires commonsense knowledge about what usually takes place in the world.

Example data from the COPA dataset. Image from https://arxiv.org/pdf/1904.01172.pdf.

Intuitive Psychology

Among the most important domains for plausible inference is intuitive psychology, since the ability to infer emotions and intentions from behavior is a fundamental human ability.

A benchmark for Intuitive Psychology with commonsense reasoning is SocialIQA. Social Intelligence QA (SocialIQA) is a multiple-choice question answering benchmark containing 45,000 questions requiring intuitive psychology and commonsense knowledge of social interactions.

Example data from the SocialIQA dataset. Image from https://arxiv.org/pdf/1904.01172.pdf.

--

--

Fabio Chiusano
NLPlanet

Freelance data scientist — Top Medium writer in Artificial Intelligence