Cracking the Code: Understanding Customer Intent in multilingual Chatbot Conversations

Published in

Airtel Digital

9 min readAug 7, 2023

1. Introduction

At Airtel, we believe that intent detection forms the foundation of any advanced conversation AI bot. In this blog, we will share our comprehensive approach in understanding the specific reasons behind our customers reaching out to us through conversational data. Our Help section on the Airtel Thanks app effectively addresses over 200K queries per day, delivering prompt assistance in seven different languages to ensure our customers’ concerns are resolved efficiently. This blog post aims to shed light on the challenges we encountered and the ingenious methods we employed to consistently deliver a best-in-class customer experience.

What is customer doing on Help chatbot?

What an intent in any conversation engine is? In simple words, an intent refers to the purpose or goal behind a user’s query or action. It’s the underlying reason or objective that user is trying to communicate or achieve.

Here are a few examples –

If a user writes the query “I am not able to hear someone’s voice while talking”, the intent of user is to complain about network quality. Hence, intent is “network issue”.
Similarly if user writes “How much recharge is enough to keep my number active?” , the user wants to know the minimum recharge amount for his cell such that the service doesn’t get deactivated. Here, the intent is “minimum recharge information”.

This way we have identified dozens of relevant issues over time across all products (DTH, Broadband, Prepaid & Postpaid) served on Airtel Thanks App. Few more examples of such intents can be “recharge issues”, “payment issues”, “ask balance”, “bill issues”, “change language”, “customer care”, “missed call alert” and “sms not working”.

Once you have the set of pre-defined intents ready, what next?

2. Moving to Model development

With a set of pre-defined intents ready we now require a model to map user-input query to it. We aim to achieve it via deep-learning Multi-classification model, but first let’s look at how we prepare data for it.

2.1 Data preparation

Initially, we started with a training set consisting a few thousand real user queries that were randomly selected and internally labeled by domain experts. To continously enhance the model’s performance, we analysed its predictions in production. We randomly selected query sets, identified influential examples, and labeled them. These labeled examples were regularly added to the model’s training set in each sprint.

Recognising the limitations of this manual approach, we sought a more effective solution. As a result, we automated this task using active learning techniques to automatically select the most influential examples. This automation significantly reduced the time needed for selecting influential examples and, at the same time, improved the accuracy of our model.

2.2 Data augmentation

Remember, we also need to support Indian vernacular languages without relying on any external translation or transliteration services as it can increase response time during inference. For instance, users could write Hindi using the Roman script, such as “Mera internet nahi chal raha hai.”

To achieve this, we implemented the following steps -

We performed translation and transliteration of all the English language examples in our train set to six different vernacular languages, and then augmented them back into the train set.
For maximising the benefits of the augmentation process, we employed a technique called Back translation. This involved translating a query from English to, let’s say, Hindi, and then translating the obtained Hindi query back to English and subsequently back to Hindi again. This iterative process helped improve the quality and diversity of the augmented data.

Data augmentation using back translation

Although one might expect that this process would yield the same query as before, it is not the case. The translation system, when converting the previously translated Hindi text back to English, frequently produces different words that have similar meanings to the original words. Consequently, we acquire an entirely distinct query to incorporate into our dataset.

Example of Back translation producing two distinct Hindi queries from a single English source

We took the same reverse approach for transliteration as well. However we used IndicBERT library from HuggingFace for this task.

2.3 Model architecture

Initially, we relied on the Starspace model provided by RASA NLU. However, we encountered a limitation with this model, as it could only operate with a single language at a time. This led us to realise the importance of having a model architecture that can support multiple languages simultaneously, without requiring explicit language specification. Hence, we utilised multi-lingual model developed by Facebook — XLM-RoBERTa, which had undergone pre-training on text from a diverse range of 100 languages, including all the Indian vernacular languages we aimed to support.

3. How do we measure the performance of model?

We established standard data science metrics to evaluate the performance of our intent model, including accuracy and F1-score. We monitored these metrics at different granular levels, such as the language level, intent level etc. All these metrics were derived on an offline evaluation set.

Unlike above offline metrics, we also defined an online production metric called “intent prediction%” which can be computed without any labels of the queries. This is defined as follows –

This was also tracked across different granular levels like — language level, text/voice input level and intent level.
The reason for using this metrics is not only to determine if the model correctly identifies the intent but also to ensure it does so with a high level of confidence. This is crucial because the model’s confidence directly impacts the user’s experience, as explained below:

When the model’s confidence falls in HIGH bucket, we proceed confidently to address the user’s issue.
If the confidence is in MEDIUM bucket, we seek user affirmation by presenting the predicted intent and asking, “Did you mean…?”
If confidence is in LOW bucket, we identify emerging intents and update the model accordingly.

By employing these confidence thresholds, we aim to provide a more reliable and intuitive user experience while also ensuring the accuracy of intent identification.

4. Problems/challenges

Some of the challenges & their solutions we encountered while building the intent detection system.

4.1 Challenges in Preparing Labelled Data for Numerous Intents

Addressing the challenges posed by a large number of intents (60+), we encountered difficulties in educating labellers about each intent’s definition, which could lead to mis-labelling and bias.

To overcome this, we adopted a labelling strategy involving three labellers per query, with majority voting for the final label to enhance accuracy. Some intents had significant overlap, necessitating separate downstream journeys for user-specific issues. To minimise labelling ambiguities, we provided example queries for each intent, especially those with significant overlap.

These measures ensured the accuracy and consistency of the labelled dataset despite the complexities of numerous intents and potential ambiguities.

4.2. Automatically discovering emerging intents

To detect emerging customer issues with the launch of new products and services, manual analysis of low-confidence queries can be time-consuming, especially with a high volume of approximately 200K queries per day.

To streamline this process, we employed intent mining techniques inspired by the research paper “Intent Mining from past conversations for Conversational Agent”. These techniques involve representing queries in an embedding space using sentence encoders and clustering them using density-based methods.

This approach is effective because queries related to new topics form smaller clusters with low volumes, easily identifiable by their size. Each cluster is assigned a confidence score derived from the mean of the intent model confidence scores for its queries. By ranking these clusters based on the average confidence score, we can efficiently identify potential new topics or issues without extensive manual analysis. Random or irrelevant queries, being distinct, do not form clusters and are accumulated into the noise cluster by the clustering algorithm.

4.3. Mitigating Noise and Bias through Consensus Modelling

Despite taking steps to minimise mis-labels and bias, instances of mis-labels in our labelled dataset were still present. To address this “garbage in, garbage out” phenomenon in machine learning, we adopted consensus modelling. Consensus modelling aims to eliminate noise and bias from the dataset, as noisy samples can lead to fluctuating predictions when exposed to models trained on different distributions.

The probability distribution from the consensus model for a noisy label sample is depicted in the image, showing significantly high probabilities for a particular class as soon as the sample does not become the part of the train set.

Distribution of consensus model predictions for a noisy sample

By leveraging consensus modelling techniques, we sought to improve the model’s robustness and reliability by reducing the impact of noisy and biased samples.

4.4. Boosting Confidence through Consensus Modelling

As mentioned earlier, we expect the intent model to identify intents with high confidence (≥90%) to address user issues effectively. However, we noticed that some queries, while correctly predicted, fell just below the 90% confidence threshold. To enhance confidence, we employed consensus modeling to identify confidence-boosting queries.

To achieve this, we randomly selected queries from production and ran them through our ensemble of models. Queries with confidence levels exceeding 90% in each model were chosen, and the label for these queries was determined based on the maximum rank across the models. By adding these selected queries to our training set, we observed increased confidence for specific query types, even in challenging scenarios with limited labeled examples. Leveraging consensus modeling in this manner aimed to improve the model’s performance and confidence for relevant query types.

Confidence boosting consensus model pipeline

4.5. Addressing Domain-Specific Terminology and Vocabulary

Chatbot conversations can have abbreviations and Airtel product-specific terms like Xstream Fiber, DTH, Fastag, Hero Recharge, Thanks app, etc. These terms have context specific meanings rather than conventional understanding. Although the intent model correctly identified these terms in queries, its confidence level was relatively low.

To address this, we improved the model’s contextual understanding by training the XLM-RoBERTa base model from scratch using a random query dump of approximately 2 million queries from production after necessary cleaning. The domain pretraining significantly impacted the model’s performance, as seen in the screenshots displaying the mask word prediction task. The model demonstrated a robust understanding of domain-specific knowledge, making it more resilient to typographical errors and enhancing overall accuracy and performance.

Two examples demonstrating the impact of domain pre-training on mask word prediction task

4.6. Optimising Inference Time to Handle Scalability

With a volume of approximately 200K queries per day, minimising the model’s inference time is crucial. To achieve this, we utilised the Optimum library, which offers a HuggingFace compatible API for quantising and optimising the model to enable accelerated inference. Through this process, we successfully reduced the model’s inference time by roughly half, ensuring efficient handling of the large query load.

5. Summary

In this article, we explored the significance of intent detection, a vital component of our Airtel’s Thanks App Help and Support VoiceBot, which handles over 200k queries daily in multiple languages. The article highlighted the challenges faced and the innovative solutions employed in data preparation, model development, and handling domain-specific terminology. The adoption of consensus modelling to reduce noise and bias, along with confidence inducement techniques, improved predictions with higher confidence levels.

As a result, the intent detection system has become highly accurate and efficient, contributing to an enhanced customer experience for Airtel users. Looking ahead, while conversational AI and large language models show promise, the company remains cautious due to telecom data privacy and regulation guidelines.

Authors: Ankit Sharma, Devi Prasad Khatua, Aditya Jain

Acknowledgement

Thanks to Abhishek Singh & Dipyaman Banerjee & entire product and business team for their guidance and support in this endeavour.

References

XLM-RoBERTa: https://arxiv.org/pdf/1911.02116.pdf, https://huggingface.co/docs/transformers/model_doc/xlm-roberta
Starspace from Facebook: https://arxiv.org/abs/1709.03856
Active learning techniques: https://www.mdpi.com/2227-7390/11/4/820, http://charuaggarwal.net/active-survey.pdf,
Back translation: https://journalofbigdata.springeropen.com/articles/10.1186/s40537-021-00492-0, https://arxiv.org/abs/2009.12452
Optimum: https://huggingface.co/docs/optimum/index,
Intent Mining from past conversations for Conversational Agent
IndicBERT huggingface: https://huggingface.co/ai4bharat/indic-bert