Using Semantic Search to Drive Smart Annotations for Chatbot Models

Samarth Agarwal
DBS Tech Blog
Published in
11 min readJan 21, 2022

Following developments in natural language processing, chatbots have become an integral part of customer service. However, the use of chatbots is not just limited to the consumer space.

At DBS, an internal unified chatbot has been designed that enhances employee experiences by answering employee’s queries across several domains including HR, IT, risk management, communication, utility and support. Employees may ask questions or submit requests for leave applications, check legal documents, find information, or connect to IT services.

DBS’ Employee Chatbot Model

The internal chatbot uses an intent classification model that detects user’s intent based on their input in the form of a question or utterance. The model identifies the user’s intent and responds accordingly, either providing relevant information or enabling a guided conversation (GC) to address the request of the user.

Each response by the model is tagged to a model confidence score. Based on internal validation, a threshold of 0.4 has been set as low confidence; questions with low confidence are tagged as “unknowns”. These conversations are passed through a content management system (CMS) where they are available to annotators from different business units. The annotated conversations are then fed back into training data and the model is re-trained with improved data.

The Problem

During annotation, business units need to go through all the unknown sentences in order to provide annotations. This creates unnecessary and repetitive work. In addition, annotations are currently done manually. If an annotator knows the relevant intent well, it can be tagged quickly and added back to the training data, otherwise they are required to go through a reference file with a long list of intents to find the best one.

As such, in a recent paper (Smart Annotation using Semantic Search for Employee Chatbot Platform in Financial Services) prepared for the KDD 2021 — MLF workshop, my colleagues from the DBS Analytics Centre of Excellence, John Jianan Lu, Steven Yang-Yu Tseng, Ying Yang Lee, Xuejie Zhang, and I outlined the successful application of a smart-annotation solution to ease the process of scanning through a plethora of “unknown” conversations during annotation.

Designing a Solution for Smart Annotation

Keeping in mind the issues, the proposed smart annotation solution leverages on semantic search techniques to give users annotation suggestions, with the aim to reduce the time for users to check and search for matching annotated labels. Below were the 2 objectives for the solution:

  1. Split the Unknown sentences in the chatbot into relevant domains and send the relevant sentences to different domain teams — reducing the workload by factors of 3 to 4 (given that the domain prediction accuracy is high)
  2. Provide 3 intent suggestions for annotators — reducing the lookup time for potential intents

In our experiments, we used utterance data as training data (data as of December 2020) and in our final validation, we used actual annotated records from Jan 2021. The chatbot data was organised in the following format:

Utterance data for training

The semantic search model finds the most similar of utterances from the training data, given an input conversation. In the utterance data, each utterance is tagged to a single intent, topic and domain, hence, they are extracted for each of the similar utterances and provided as model suggestions.

When providing top 3 intent suggestions for each sentence, results are de-duplicated to provide the top 3 unique intent suggestions. For example, for the query “staff id card”, the top matching utterances might be from the same intent “new_staff_id”. We considered them as only 1 intent suggested. The de-duplication step allowed us to provide 3 unique intents selected from top to bottom of the query results.

For domain prediction, corresponding domains with the most similar utterance are provided as the domain suggestion in the training data. In production, if top 3 intents are from different domains, all the annotators from these business units can view the input conversation.

Performance measurement

Performance measurement of the solution was derived from a set of annotations using the following metrics:

  1. Average domain prediction accuracy
  2. Top K topic accuracy: If any of the top K topic suggestion(s) is the same as ground truth, it is considered correct
  3. Top K intent accuracy: If any of the top K intent suggestion(s) is the same as ground truth, it is considered correct

For our evaluation, we set K as 1 and 3. The ideal case would have been to have very high Top 1 intent accuracy so that the whole process can be automated. However, due to the high number of intents (1000+) and intents with fine differences, we evaluated Top 3 accuracy as well.

As such, for each annotation input, the algorithm targets searching for the top 3 most similar utterances and getting corresponding intents, which is a k-nearest-neighbour searching problem. To optimise query speed, we explored both exact search and approximate nearest neighbour (ANN) search methods for retrieving the matching candidates. Under the ANN search method, the data is partitioned into smaller fractions of similar embeddings. The index can be searched efficiently even with millions of points in the vector space. It is worth noting, the results of ANN are not necessarily exact; vectors with high similarity sometimes might be missed. It is a trade-off between speed and accuracy, and since we were not dealing with millions of data points, we decided to use exact search.

Experimental Results

A Term Frequency-Inverse Document Frequency (TF-IDF) based Lexical Search approach was used as a baseline for the semantic search solution. It was found that considering only nouns as vocabulary gave a better accuracy. Part-of-speech tagging was applied with the Penn Treebank tag-set using the Natural Language Toolkit (NLTK). This process was applied to the training data utterances, and tokens were further filtered by document frequency to form the final vocabulary of tokens for unigram TFIDF vectorisation. The fixed vocabulary and IDF weights were then used to transform further incoming data. Cosine distance was used as a distance measure to determine whether utterances were similar.

This baseline method was able to match the domain of validation utterances with good accuracy. However, it performed poorly on intent suggestions.

The bag-of-words approach ignores the overall context of utterances and fails in cases where overlapping keywords are used for different intents or topics. An example that demonstrates this shortfall would be queries searching for more information assessing flight risks in staff leaving the organisation. By not considering the semantic meaning, the tokens “flight” and “risk”, would instead match most similarly to the more frequent utterances regarding claiming of flight tickets and for business operational risks.

In addition, we evaluated various pre-trained sentence-BERT models like stsb-distilbert-base, stsb-bert-base and stsb-roberta-base on the validation dataset. We have found that stsb-roberta-base performed the best among different sentence-BERT models. Being pre-trained, it outperformed the baseline TF-IDF model without any additional training.

However, after analysing misclassifications from pre-trained models, we found that the model performed poorly on domain specific conversations. As the next step, we used the chatbot data to finetune the stsb-roberta-base model so that it became more familiar with our specific dataset. To fine-tune the stsb-roberta-base model, sentence pairs needed to be created with a similarity score. The table below shows the data creation logic.

Similarity data created with utterance pairs

With the presence of different intents, topics, and domains, it is then easier to group different sentence pairs. The similarity is ordinally defined and is not exact.

The amount of data generated can be huge. For example, if we consider a similarity=1 scenario, 435,000 sentence pairs would be generated for our data. In our case, by using sampling to reduce data for the scenarios where similarity <1, the final data generated contained up to 2 million training examples.

For training, datasets of 300k and 500k were used where pairs are prioritised (to get difficult sentence pairs) using the following conditions:

  • for similarity=1, sentence pairs with fewer overlapping words are preferred and,
  • for similarity<1, sentence pairs with more overlapping words are preferred.

Siamese networks were used to fine-tune the pre-trained stsb-roberta-base model using Cosine Similarity Loss, which minimises the difference between label and cosine similarity of the two sentence vectors. Below are the training details and parameters:

  • Data size: 300,000 sentence pairs
  • Number of epochs: 2
  • Batch size: 8
  • Warm up steps: 10% of total steps
  • Learning rate: 2e-5

Spearman correlation was used to compare similarities before and after fine-tuning of the sentence embeddings, and significant improvement was observed after fine-tuning the stsb-roberta-base model.

Combining both TFIDF vectors and Sentence Embedding vectors (from the fine-tuned model) was used for the final search solution as it combines token levels and semantic information in the data.

Since the embeddings space is not aligned for TFIDF and Sentence Embeddings, we took the following approach:

First, two sets of nearest neighbours by cosine distance were retrieved from the training data for the TFIDF vector and the Sentence Embedding vector of search queries respectively. Two variable weighting parameters were then multiplied by the resulting two cosine distances of the two encoding approaches, before being summed up as the final distance metric.

𝐹𝑖𝑛𝑎𝑙𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒= 𝛼×𝐶𝑜𝑠𝑖𝑛𝑒𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑆𝐸+𝛽×𝐶𝑜𝑠𝑖𝑛𝑒𝐷𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑇𝐹𝐼𝐷𝐹
where 𝛼+𝛽=1 𝑎𝑛𝑑 0≤ 𝛼,𝛽≤1

The set of nearest neighbours via this final distance metric was then used for matching the top topics and suggested intents. In our implementation, 𝛼 was 0.66 and 𝛽 was 0.34, giving almost double the weightage to Sentence Embedding over TFIDF. The TFIDF approach with nouns-only tokens allows the algorithm to focus more on the object / product of interest.

This combined approach slightly improved the performance over using Sentence Embedding alone for matching.

Experimental results

Analysing Errors

With many intents and utterances being added continuously by different teams, it is difficult to have a mutually exclusive and collectively exhaustive list of intents and utterances even after guidelines are provided. Additionally, there have also been discussions about “Data Ops” recently, where the focus is to improve the quality of data which will in turn improve the model. Once the model is developed, predictions are analysed to find cases where the model is under-performing, and strategies are developed to combat these. Therefore, to address the errors in our model, the following analysis was done:

  1. Topic or intent level accuracy with top 3 predicted intents: Identify intents with poor performance and analyse which intents the model predicts instead of the correct one.
  2. Word level accuracy: Check if certain words have poor-performance and how many training examples are present containing the word.

Based on this analysis, we were able to find out the below issues and worked with the business team to improve the data and subsequently, the model itself.

  1. Overlapping data in different intents: Multiple intents with very similar utterances that lead to misclassifications.
  2. Escalation cases: These are defined as cases that cannot be handled by the model and might need oversight from the manager/HR. Due to the broad possibilities of this intent, model underperforms to predict Escalation.
  3. Novel terms: Certain domain specific words, like system names which are under-represented in the data, were not predicted well.
  4. Bias in transformer models: It was found that language models were sensitive to numbers and dates. For example, for a query — “Apply leave on 2nd September”, the similarity algorithm will rank utterances with dates higher even though it is not related to leave. This could be due to the inherent data bias while training the base models.

Impact

By employing smart-annotation solutions, we have observed a 40–60% annotation workload reduction for domains like HR and IT with larger user query shares. Furthermore, there was a 90% workload reduction for smaller share domains like Compliance, while maintaining the number of annotations. Our top 3 suggestion accuracies also aligned with the development accuracy numbers in our annotated cases.

The Way Forward

Through this paper, we are able to show that a semantic search based smart annotation solution works well for Chatbots. Combining lexical and dense embeddings (fine-tuned on domain) outperforms other approaches. With large text data sets, DataOps strategies are recommended, to improve data quality that will lead to higher quality models. By improving the data quality suggestions, we estimate that model accuracies could improve by more than 10%.

Overall, the advantage of using a semantic search solution over text classification is 3-fold:

  1. It can work with intents containing low utterance count (those that were not taken into the text classification model were due to low utterance count).
  2. It is still able to perform decently well with newly added utterances and intents without further training.
  3. It is more interpretable, as top similar matches are available for observation, giving us more insights on where the model might be failing.

Following this, below are some of the next steps to further improve and scale the existing solution –

  1. Multi-lingual models: Develop multilingual models across other languages like Chinese and Bahasa. This will help us scale the solution for the bank for use in other countries or regions.
  2. Adding cross-encoder model layer for re-ranking: Literature surveys have shown cross-encoder models perform well as a re-ranking step on top of retrieved sentences from the first stage. This is due to the fact that cross-encoders look at queries together in the same pass while predicting similarity.

Link to the Traditional Chinese version of the article.

REFERENCES

  1. Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova, 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), (Jun 2019), 4171–4186.
  2. S ́ebastien Harispe, Sylvie Ranwez, Stefan Janaqi, Jacky Montmain and Ecole des mines d’Al`e, 2015, Semantic Similarity from Natural Language and Ontology analysis, Synthesis Lectures on Human Language Technologies, (May 2015).
  3. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer and Veselin Stoyanov, 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692.
  4. Nils Reimers and Iryna Gurevych, 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv:1908.10084.
  5. James Pustejovsky and Amber Stubbs, 2012. Natural Language Annotation for Machine Learning. O’Reilly Media, Inc, (Oct 2012).
  6. Ting Liu, Andrew W. Moore, Alexander Gray and Ke Yang, 2004. An Investigation of Practice Approximate Nearest Neighbor Algorithm. Neural Information Processing Systems, (Dec, 2004).
  7. Luyu Gao, Zhuyun Dai, Tongfei Chen, Zhen Fan, Benjamin Van Durme and Jamie Callan, 2021. Complementing Lexical Retrieval with Semantic Residual Embedding. arXiv:2004.13969, (Apr 2021).
  8. Andrew Ng. Machine Learning Yearning (Draft Version). https://d2wvfoqc9gyqzf.cloudfront.net/content/uploads/2018/09/Ng-MLY01-13.pdf
  9. Nils Reimers, 2021. Sentence Embedding Models. Retrieved From https://www.sbert.net/docs/pretrained_models.html#sentence-embedding-models

--

--