Using BERT to Build a Whole-Of-Government Chatbot

Published in

AI Practice and Data Engineering Practice, GovTech

13 min readOct 17, 2020

If you haven’t already used it before, Ask Jamie is a virtual assistant (also commonly known as a chatbot) implemented on government agencies’ websites to answer questions within the agencies’ areas of responsibilities. Currently, each agency has its own Jamie whom you go to for questions related to that agency. So if you have a question about Primary One registration, you will ask MOE’s Jamie, and if you have a question on COVID-19, you will ask MOH’s Jamie.

This is fine if you are a public servant like me, or a well-informed citizen who is (somewhat) familiar with the responsibilities of the different agencies in Singapore. But what if someone such as my elderly parents, or a foreigner, who does not know the differences between the agencies so well, wants to ask “the government” some questions? With 16 Ministries and more than 50 Statutory Boards in the Public Service, would they know which bot to go to? Is there something we can do to help them? 🤔

Together with my colleagues in the Virtual Assistant team of the Moments of Life (MOL) Division, we thought of an idea. Why not use the wisdom of the crowd to help them? Since its conceptualisation in 2014, many citizens have been asking the chatbots questions. Assuming that these citizens know which agencies to ask the questions (probably not always, but most of the time), we can use this knowledge to direct a new question to the correct agency, based on its similarity with existing questions posed to that agency!

With that, I did a Proof-Of-Concept (POC) for a “Super Jamie”, the one Jamie who knows it all, across the whole of government. Super Jamie will take any question from the public, and direct it to the correct agency’s chatbot to be answered.

A Text Classification Problem

The key component in Super Jamie is a redirection engine, which is actually a text classifier, where:

the questions asked are the texts to be classified, and
the agencies receiving the questions are the class labels

Therefore, this problem can be solved using supervised machine learning! In order to use machine learning on our data, we had to do the following:

Preprocess the text
Organise the data
Represent the text as numbers
Use the data to train a machine learning model

Text Preprocessing

Preprocessing is an important step in text analytics in order to normalise the text such that minor syntactic variations will not affect its numerical representation. The Gensim library provides useful functions for text preprocessing.

We preprocessed the text using the following steps:

Fix encoding issues using ftfy
Remove URLs
Apply lowercasing
Remove punctuation
Remove multiple whitespaces
Remove html and xml tags

And did the following to clean up the data:

Remove questions which are empty after preprocessing
Remove duplicate questions from the same agency

Organising the Data

Removing Irrelevant Data

Getting the labels for the questions is relatively straightforward. We just had to use the agencies which received the questions as the labels. However, remember our assumption that citizens know which agencies to direct their questions to most of the time, and not all the time? This means that each agency’s chatbot could receive questions not meant for the agency (by users who did not know the correct agency to approach). We had to remove such data to increase the redirection accuracy, but how could we do so?

Fortunately for us, the current Ask Jamie chatbots were designed to only answer questions within each agency’s purview (or more accurately, within the agency’s knowledge base). Questions outside of this will be answered with low confidence scores. In addition, public officers in the agencies had painstakingly tagged the questions answered with low confidence scores, but were still relevant to the agencies, as “Knowledge Gaps”, so that they could add the answers to their knowledge bases. Therefore, we just had to remove questions which were answered with low confidence scores, and which were not tagged as “Knowledge Gaps”, from the dataset, to remove the irrelevant data!

Here are some examples of questions removed from the GovTech’s chatbot dataset:

Adding a New Class for Common Questions Asked Across Multiple Chatbots

When users go to an agency’s chatbot, we know that their question is specific to that agency. However, if the same question is being asked to Super Jamie, that context is lost. For example, if a citizen asks “what is your email address?” to the GovTech chatbot, it will reply “info@tech.gov.sg”. If the citizen asks the same question to Super Jamie, what is she supposed to reply? She will need to follow up with more questions in order to ascertain which agency to direct the question to.

Since our objective was to predict the agency given the question by learning from existing data, common questions which can be asked across different agencies’ chatbots with different answers posed a big problem because we did not know which agency the user was looking for. To deal with this, we grouped the questions which had been asked to at least three agencies’ chatbots into a new class called “Shared Intent”. If the predicted class for a new question is the Shared Intent class, Super Jamie will ask follow-up questions to decide which is the right agency.

Here are some of the popular Shared Intent questions:

For the POC, we used one year’s worth of data which consisted of 1.32 million questions from chatbots on 41 government websites. Most of these are ministries’ or agencies’ websites but some are not: They are either sites for specific government platforms (e.g. GeBiz, Business Grant Portal, Charity Portal) or micro-sites of an agency dealing with specific subdomains (e.g. IRAS-Individual Income Tax, IRAS-Corporate Tax, IRAS-EPES). For simplicity and brevity, I will loosely use the term “agency” to refer to the site owner of the chatbot hereafter in this post, since most of them are agencies. After cleaning up and re-organising the data, the data distribution across the 42 classes (which includes the Shared Intent class) is shown in the figure below. The agency names have been redacted and replaced with Ak where k is an agency’s rank in the dataset by descending order of size. The data is severely imbalanced because some of the websites were not as popular as the others. In addition, some of the Ask Jamie bots were newly deployed at the time of collection and we did not have the full year’s worth of data.

Representing Text as Numbers Using BERT

After organising and labelling the data with the correct class label, we had to convert text to numbers in order to use them in machine learning models. This was done using the BERT (Bidirectional Encoder Representations from Transformers) model, which is the state-of-the-art in Natural Language Processing (NLP).

What is BERT?

BERT has been explained in many places. I highly recommend reading Jay Alammar’s The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) and The Illustrated Transformer as he explains how BERT works very clearly using illustrations. Other good explanations can be found at:

You can also read the original paper here.

To briefly summarise, BERT makes use of the transformers architecture to learn how to represent text segments (e.g. sentences) as numbers. In this architecture, words are represented as embeddings (vectors) and the values in the embedding of a word change depending on what the other words in the sentence are (concept of “self-attention”). Therefore, apple in these two sentences:

“I bought apples from the market to make apple juice.”
“Google and Apple are fighting to be the leader in the smartphone market.”

have different representations due to their surrounding words. The representations are further fed into a feed-forward neural network to form an “encoder” block, and multiple encoder blocks are stacked to give the sentence a “deep” representation.

*One encoding block — Image taken from Jay Alammar’s The Illustrated Transformer*

*Stacking encoders to form BERT — Image taken from Jay Alammar’s The Illustrated BERT, ELMo, and co.*

BERT was taught how to “understand” English by making it read the whole English Wikipedia corpus, and other big corpora crawled from the internet. It shows its understanding by being able to represent text using embeddings which have mathematical properties similar to the properties of the actual texts in the English language. For example, the BERT embedding for sentences 1 and 2 have a higher cosine similarity than that of sentences 1 and 3:

“I live in London with my family.”
“New York is my hometown.”
“Sharks are endangered because they are widely hunted for their fins.”

Getting the Embeddings

Using the excellent Hugging Face Transformers library, we fed the questions into a pre-trained model and to get their embeddings. I used ALBERT (“A Lite BERT”) instead of the original BERT because experimental results showed that ALBERT was able to achieve better results with less memory consumption and run-time, due to a huge reduction in the number of parameters. The following diagram shows how a sentence is converted to an ALBERT embedding:

*Getting an ALBERT embedding of a sentence*

The [CLS] token is a special token whose embeddings represents the whole sentence’s embedding after the model is fine-tuned (explained in the next section). The alternative to using the [CLS] token for the sentence representation is to take the average, maximum, or minimum of each dimension, of all the word embeddings in the sentence. Conceptually, some sort of pooling (e.g. max, min, average, or a combination) is done, through the use of multi-headed self-attention, and amending the feed-forward neural network weights, during the fine-tuning stage to make the [CLS] embedding a sentence vector. You can think of this as a weighted combination of the cosine similarities of all the tokens’ word embeddings (in a complex way).

Training the Model

Before we could actually train a classifier using the embeddings, there was one more step which we had to do. Remember what I said about the BERT models being pre-trained on generic corpora like Wikipedia? This pre-trained ALBERT model understood generic English text but it did not necessarily understand our chat data very well. For example, it might not understand what “register p1 go where” means because such a sentence structure has never appeared in the corpora it was trained on.

This situation is analogous to making a kid read all the English books and articles in the world. After reading everything, he has an almost perfect understanding of the English language. With this understanding, he might be able to make sense of our chat data because it uses English words, and many questions are grammatically correct. However, for him to better understand those parts which are not as similar to the literature which he has read (e.g. broken English, short forms, jargons), we need to let him read our data and try to understand it too.

Similarly, for the ALBERT model, we had to let it understand our data to get more accurate embeddings because the current embedding was produced in the context of the generic English language. This step of getting it to understand our data is known as the fine-tuning step, where we use the model to perform a task, and the model weights get updated by the data through the task.

In the transfer learning stage for the pre-trained (AL)BERT models, “fake tasks” like masked word prediction, and next sentence prediction were given to the models to update the weights in an unsupervised manner. However, in our case, we had a real task on hand, which was to learn how to predict the agency, given a question. Using Hugging Face Transformers’s AlbertForSequenceClassification model, we were able to do both simultaneously by attaching a sequence (sentence) classification head, which is a one layer feed-forward neural network, to the pre-trained ALBERT model. During training, both the weights of the feed-forward neural network and the weights of the ALBERT model would be updated so that it could learn both how to understand, and classify, the data at the same time. Diagrammatically, the training looks something like this:

*Training a classifier using ALBERT and a single layer feed-forward neural network*

We then trained the classifier in a typical machine learning multi-class classification fashion in PyTorch, using the Adam optimiser to minimise the Cross Entropy loss.

The dataset was split as follows, stratified by the class labels, for training, evaluation, and hyperparameter selection.

*Splitting the dataset for training, evaluation, and hyperparameter tuning*

20 trials were run using Optuna and the following parameters give the best validation accuracy:

Adam optimiser learning rate: 3.97e-5
Number of training epochs: 10
Batch size: 16

We used this set of parameters to train on the original training set, and evaluated the results on the test set.

Results

Seems like this result is

However, is there more to it than meets the eye? Let’s take a look at the individual agencies’ F1 scores (blue bar) in the chart below, where the red line shows the size of the data an agency has, as a proportion of the whole dataset (axis on the right).

*F1 score and class proportion for each agency*

On one hand, we have 17 agencies which have good F1 scores of more than 80%, one of which is the Shared Intent class which has a very good F1 score of 88%. On the other hand, we have 3 agencies whose F1 scores fall below 50%. We can see that in general, the results are better for agencies which have more data. The notable exceptions are A24, A28, and A29. One reason for this is that the questions posed to these agencies are very specific and constrained to a pretty narrow scope which does not overlap much with those asked to other agencies (e.g. water issues, court matters). Therefore, we do not need a lot of data to be able to tell which questions should be directed to them.

The overall accuracy is high because the model classifies the classes with larger samples better, and the test set also has more of such samples.

Classes with low accuracy is a problem because we want each agency to have a decent baseline accuracy, or else questions asked to Super Jamie which are relevant to that agency will seldom be directed correctly. We tried balanced sampling using different sampling weights for different classes based on their sizes but did not get much success because the dataset is severely imbalanced, with the smallest class only having 0.08% of the samples of the largest class. This is a problem which we are currently dealing with, and possible solutions include waiting for more data from these agencies’ chatbots to come in, replicating the samples in them, or supplementing the data for these agencies with questions from other sources like feedback forms, emails and transcripts of phone calls.

Future Work

Better Identification of Shared Intents

We identified shared intents by finding questions, which are exactly the same after preprocessing, and repeated across multiple agencies. However, some questions have the same meaning but do not have the same exact syntactic form. For example, consider the following scenario:

Question 2 has the same meaning as Question 1 and should also be considered a Shared Intent question. However, because nobody asked the question in the same way as the one who asked Question 2, it did not get picked up.

To identify such question-pairs, we need to apply syntactic and semantic matching to the questions and highlight the closely matching ones for a human to validate if they are indeed Shared Intent questions. This is a work-in-progress.

Fine-tuning ALBERT Using Pre-Training

While the fine-tuning which we had done modified the model’s weights to make it understand our chat data better, pre-training can be done on our data to fine-tune it even further. BERT models have a fixed vocabulary and they handle Out-Of-Vocabulary words by breaking them up into sub-words. For example, the word iras is not in ALBERT’s vocabulary, so ALBERT represents it using two subwords, ira and s. Both the subwords are not exclusive to iras and the weights can be changed through self-attention with other words during both the transfer learning phase and our fine-tuning phase. What this means is that iras’s representation is also affected by the representations of ira and s across the different corpora the model is trained on. By doing pre-training and adding new words in our data to the vocabulary, we can represent them better as they will have no dependency on other words. Hopefully, this will lead to better classification results.

Conclusion

With an overall accuracy of 85%, I would say that our POC to build Super Jamie, a Whole-Of-Government chatbot, is pretty successful. MOL’s Virtual Assistant team is currently making plans to roll it out (check out the last section of this interview with Bertrand, the team’s Smart Nation Fellow, here). Besides the software engineering work required to build the bot, I have listed some issues which we have to resolve before we can put it in production. If you have any suggestions on how to improve the work further, please reach out to us! For those of you who are interested in NLP, do reach out to us too! At GovTech’s Data Science and Artificial Intelligence Division, we have many interesting NLP projects to tackle use cases from many different agencies!