Does Utterance Entail Intent?: Evaluating Natural Language Inference Based Setup for Few-Shot Intent Detection

Ayush Kumar

Published in

The Observe.AI Tech Blog

4 min readSep 15, 2022

This blog refers to our technical paper accepted at Interspeech 2022, Incheon, South Korea

The problem at hand

Intent detection is one of the core tasks in the spoken language understanding (SLU) domain. In practical settings, we only have sufficient labeled examples for a small set of intents. For remaining intents, we have access to only a handful of labeled data points per class. The set of labeled examples is referred to as the base (seen) set, while the set containing a handful of labeled examples per class is referred to as a support (novel) set. The machine learning task is to classify an input belonging to one of these support classes. This setup is generally referred to as few-shot intent detection. In a more practical setup, the task is to classify an input belonging to a combined intent space of base and support set intents. This setup is referred to as generalized few-shot intent detection.

The problem at hand is to train a better few-shot and generalized few-shot intent detection systems.

Proposed Setup: NLI-FSL

We utilize an entailment based approach named NLI-FSL, as presented in our work. We draw inspiration from Natural Language Inference (NLI) and convert the intent detection task into a textual entailment task that measures the true value of the hypothesis (class-label) given a premise (utterance). We conjecture that intent’s class-label names such as add-to-playlist, rate-book etc. carry semantics that reflect the meaning of utterances in respective intents. Thus, we transform the task of intent detection to textual entailment by simultaneously setting the input utterance as the premise and casting the intent’s class-label name as the hypothesis.
Our proposed approach named NLI-FSL is a two-step process where we first transform the raw dataset into a NLI-formatted dataset using samples of both positive (entailed) and negative (not-entailed) examples. In the next step, we fine-tune a pre-trained language model (PLM) upon the transformed dataset and then use it for inference. We utilize the BERT model as the PLM in all our experiments.

Architecture overview of the proposed method: NLI-FSL

During inference, we transform the query/test set into the NLI form, similar to the training dataset. For each test utterance u and the label space Y, the predicted label is determined by the intent having highest probability for entailment label e.

Datasets

We evaluate our proposed methodology over five benchmark Intent Detection (ID) datasets for FSID/GFSID tasks. These datasets represent real-world queries in different domains such as banking, airline, and customer service described as:

SNIPS is a dataset of crowd-sourced queries distributed among 7 user intents of diverse complexity [19]. We use three intents in the novel set and four in the seen set
Airline Travel Information System (ATIS) is a domain specific intent detection benchmark for flight reservation queries [20]. In line with previous works, we select 4 classes as the novel classes and remaining 12 as seen class
BANKING77 is a single domain dataset composed of online banking queries annotated with their corresponding intents. We use 27 classes from 77 classes as novel and take the remaining 50 classes as seen classes
In order to compare the method on a dataset containing a higher number of labels, we utilize CLINC150 dataset [22]. It is a multi-domain intent detection dataset containing a total of 150 classes. We take 50 classes as novel and take the remaining 100 classes as seen classes.

Results

The results show that the NLI-FSL approach outperforms all methods in both 1-shot and 5-shot FSID and GFSID settings. Our method also produces the best harmonic F1-score suggesting the effectiveness of the method in balancing its performance between both FSID and GFSID settings.
A stronger performance by NLI-FSL is observed in GFSID settings, where the proposed method outperforms other baselines by a minimum of 5% in F1 score. We obtain distinctively better results in 1-shot GFSID setup with gains of upto 20% over the baselines. This shows that the proposed method is highly effective than other baselines in a challenging yet more practical setup.
Another observation from results shows that baseline methods have a bigger drop in performance going from a 5-shot setup to a more challenging 1-shot setup, when compared to results obtained in NLI-FSL method. For example, the drop in performance is 11.83% for DNNC and 5.67% for NLI-FSL in 1-shot FSID on SNIPS dataset (similar trend can be observed for other datasets).
Further analysis shows that the NLI-FSL method is particularly effective for large label space datasets (BANKING77 and CLINC150), outperforming the most competitive baseline of DNNC by significant gains.

Conclusions and Takeaways

We propose NLI-FSL as a method for few-shot intent detection, which utilizes the semantics in the label names in a NLI-based entailment task.
Proposed setup outperforms competitive approaches in both 1-shot and 5-shot FSID and GFSID settings by >7–8% F1 score across datasets.
Method performs especially well under large label spaces, as evident by its performance on CLINC150 dataset, a dataset with 150 intent classes.

Learn more about how we’re changing conversation intelligence for contact centers around the world at Observe.AI.