Early intent detection using n-gram language models

Ivan Kunyankin
Devexperts
Published in
5 min readAug 22, 2022

Offer relevant solutions to your chatbot users while they are still typing

Intent suggestions. Image by Tanju Sami from Devexperts

In this post, we’ll discuss how to use n-gram language models to detect user intent early — as they type. The actions corresponding to the top K of the most probable intents can be shown to the user. We call them intent suggestions.

The idea behind intent suggestions is similar to autofill that captures user-entered words to make predictions. But instead of predicting the next word, we try to deduce what the user really wants and serve them a list of corresponding actions.

The purpose of this feature is to save our chatbot users time and make sure they end up getting the help they need. We use a similar approach in Devexa, our smart assistant for traders.

Table of contents

  1. Problem formulation
  2. N-gram language models
  3. Intent detection using n-gram language models

Problem formulation

Now that we understand the idea and motivation behind intent suggestions, let’s formulate a list of desired characteristics of our future solution:

  1. The solution needs to be fast enough with low resource consumption. Most often it will have to make predictions multiple times for a single user query.
  2. It must operate on partial phrases and even separate words.
  3. It needs to reflect the statistical probability of each intent being implied. We will often have to make predictions based on phrases that don’t have enough context to derive the intent from (e.g., “I can’t” may imply multiple intents). For this reason, we want the intents most often containing an input phrase in their training data to be deemed the most probable.

The established list of characteristics, or restrictions, narrows down a set of possible solutions for us. Instead of using a neural network (which is often a go-to option), we’ll be using a simple yet powerful concept of n-gram language models.

N-gram language models

We will briefly go over the main idea and look at the training and inference processes and then move on to the proposed approach.

In general, n-gram language models calculate relative frequency counts of words that follow every (n-1)-gram found in the training data. In other words, during training it slides over texts in the data using a window of size n-1 words and remembers which words it saw going after it and how often.

These frequency counts can later be used to calculate probabilities of different words being the next word in the input sequence.

N-gram language model training. Image by author

As the result of training we will have a list of all the found n-grams with their frequency counts. Or, more conveniently, a dictionary with (n-1)-grams as keys and all found nth words with their frequency counts as values.

During inference, we use the last n-1 words of the input sequence to look for the most frequent next word and return it as our prediction.

N-gram language model prediction. Image by author

The probability of a specific word being the next word is equal to the frequency count of the corresponding word divided by the sum of frequency counts of all the words found after the given (n-1)-gram.

Probability calculation. Image by author

The beauty of n-gram models for some applications is that they don’t try to use the context of the input sequence to predict the next word. It relies solely on statistics of the training data. This is the main advantage of the model for our purpose.

Intent detection using n-gram language models

Let’s discuss the actual code now. We’ll take a look at the core pieces. The full version can be found here.

Notation. Given that we are trying to detect a user’s intent rather than the next word, we’ll use a slightly different notation. N in our case will represent the number of words used to predict intent probabilities. That is, a 3-gram (or trigram) model will use three words to make predictions.

The first step is model initialization. Notice that we also initialize a child model with n equal to n-1. Once initialized, it will create its own child model — and so forth until n equals 1.

This recursive approach allows us to take into consideration frequency counts from smaller n-grams in case there is no match for the parent model.

Recursive parameter represents the weight used with child models’ frequency counts. Smoothing parameter prevents zero-division and performs label smoothing.

The overall training process comes down to repeating the following step for every data item — label pair: we slide over the text item using a window of size n (words) and increase each n-gram’s counter of the corresponding intent. Then we run the same procedure with our child model.

That is it about training. Let’s now explore the inference essentials. Our goal is to get frequency counts from our parent and child models for the last n-gram of the input context. We start with a non-empty counter (to prevent zero-division) and update it with the parent model’s frequency counts (if any) and the weighted child model’s frequency counts.

Once we get frequency counts, we can calculate intent probabilities by dividing each intent’s counts to the sum of all intents’ counts.

The last thing to keep in mind is that you should probably pre-process your texts before both training and inference so that there are no redundant variations of the same phrase in the counters.

That’ll be it for intent suggestions. A little reminder: the full version of the code is available here. If you find it useful, consider giving it a ⭐️.

Closing thoughts

Let me know if you have any questions. You can also reach out to me via LinkedIn.

To learn more about Devexperts’ products and AI development services, visit our website.

--

--