How Natural Language Inference Models “Game” the Task of Learning

Sam Bowman, Assistant Professor of Linguistics and Data Science, explores how state-of-the-art NLI models rely on annotation artifacts

The goal of natural language inference (NLI), a widely-studied natural language processing task, is to determine if one given statement (a premise) semantically entails another given statement (a hypothesis). To generate data for this task in the past, CDS Faculty Member Sam Bowman orchestrated the crowdsourcing of two large-scale datasets containing human-annotated inferences, known as SNLI and MultiNLI. Bowman directed annotators to respond to a given premise with three hypotheses: one entailment, one neutral, and one contradiction.

An example of annotations from Bowman’s SNLI dataset:

A woman selling bamboo sticks talking to two men on a loading dock.
Entailment: There are at least three people on a loading dock.
Neutral: A woman is selling bamboo sticks to help provide for her family.
Contradiction: A woman is not taking money for any of her sticks.

Typically, a text classification model would be given the premise and asked to classify these crowdsourced hypotheses as entailment, neutral, or contradiction. But in a new study, Bowman, with a team of researchers, explored whether a text classification model can classify human-generated hypotheses without being given the premise by relying on annotation artifacts instead.

Annotation artifacts are unintentional patterns in the dataset left by human crowd workers that can signal a hypothesis’ type. The researchers, through a statistical analysis, found that generic words (animal, instrument, outdoors) were associated with entailed hypotheses; modifiers (tall, sad, popular) and superlatives (first, favorite, most) with neutral hypotheses; and negative words (no, nobody, never, nothing) with contradictory hypotheses. Sentence length also proved to be an important artifact: entailments often contained fewer words than neutral hypotheses.

Bowman and collaborators used a text classifier called fastText to determine the effect of these artifacts on text classification. They found that, without access to the premise, fastText was able to correctly classify 67% of the hypotheses in the SNLI dataset and over 50% of the hypotheses in the MultiNLI dataset, well above a baseline measure. These results demonstrate the high degree to which text classifiers rely on annotation artifacts.

The researchers note, “This raises an important question about state-of-the-art NLI models: to what extent are they ‘gaming’ the task by learning to detect annotation artifacts?” They addressed this question by training three high-end NLI models on the SNLI and MultiSNLI datasets. The three NLI models were then evaluated on the full datasets, hard datasets (the hypotheses which fastText incorrectly identified), and easy datasets (the hypotheses which fastText correctly identified).

These NLI models performed significantly better on the easy datasets, exposing the extent to which they leveraged annotation artifacts. Based on these results, Bowman and his collaborators concluded that annotation artifacts inflate model performance. To mitigate models’ reliance on annotation artifacts in the future, Bowman suggests artifacts be balanced across classes. His work reveals that, despite recent reported progress, natural language inference is still very much an open problem.

By Paul Oliver

Like what you read? Give NYU Center for Data Science a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.