Dialogue NLI: New Training Dataset for NLI Models

Labeled sentence pairs improve performance of NLI models

Human: “Do you have any pets?”

Model: “Yes, I have two cats.”

Human: “Oh cool, what your cats’ names?”

Model: “I don’t have any pets!”

This is an example of the kind of interaction between a human user and a dialogue model that current research is working to prevent. As you can see, the model’s second response is confusing and contradictory. Historically, dialogue models have struggled to produce consistent responses in conversation, which can result in a jarring user experience. The utterances they produce still make sense — just not in conversational context. This characteristic makes the inconsistencies difficult to prevent.

Because dialogue models aim to converse fluently with a real person, consistency is extremely relevant. Consistency errors occur when a model contradicts itself. These responses come across as failing to “remember” preferences or expressions from previous statements. Persona consistency errors are similar, but occur when a dialogue model expresses a sentiment that contradicts an element of its persona. To combat personal consistency issues, related work has attempted to train models by providing a list of personality traits to learn and adhere to — much like a spy learning a cover story to go undercover. Unfortunately, conversational results still struggled with consistency errors.

Kyunghyun Cho, Assistant Professor of Computer Science and Data Science, Sean Welleck, NYU Department of Computer Science, James Weston, NYU and Facebook AI Research, and Arthur Szlam, Facebook AI Research, recently published research for which they created a Natural Language Inference (NLI) dataset. The dataset, Dialogue NLI, contains labeled sentence pairs for training a natural language inference model, which can then be used to improve a dialogue model. Researchers noted, “Dialogue generation can be framed as next utterance prediction.” The model received input in the form (premise; hypothesis), as well as an associated label generated via human intelligence. Each pair was marked according to these rules: the premise entails the hypothesis, the premise is neutral with respect to the hypothesis, or the premise contradicts the hypothesis. The pairs were thus labeled as “contradiction,” “neutral,” or “entails.”

An NLI model trained with this dataset helped a dialogue model re-rank possible utterances to reflect the ideal response. The re-ranked responses were clearly more consistent with the description of the model’s characteristics. Research results demonstrated that NLI improves performance on a downstream dialogue task. The researchers also discuss this latest application of NLI, expressing that there are opportunities to develop the model to include even more functionality.

By Sabrina de Silva