Training the OneService Chatbot to Analyse Feedback on Municipal Issues Using Natural Language Processing and Deep Learning

Watson Chua
DSAID GovTech
Published in
16 min readApr 5, 2021

In order to provide a quality living environment for residents, the Municipal Services Office (MSO) needs to tackle municipal issues efficiently and effectively. However, this is often hampered by incomplete information from the residents reporting the issues, and the laborious effort required to manually analyse hundreds of cases each day to decide which agencies should handle them. In this article, we share how GovTech is working with MSO to alleviate these problems using Artificial Intelligence: We built a conversational chatbot which residents can report their issues to, and then use Natural Language Processing and Deep Learning to automatically analyse the feedback. This enables the chatbot to request missing information required to resolve the issue from the user, and then route the case to the agency our Deep Learning model deems best to handle it.

Introduction

Artificial Intelligence on The National Agenda

In November 2019, DPM Heng Swee Keat announced a “National AI Strategy” which mapped out how Singapore would develop and use Artificial Intelligence (AI) to transform the economy and improve peoples’ lives. The strategy would start with five projects, one of which is an AI-powered chatbot for residents to report municipal issues to. Fast forward to Mar 2021, 16 months later: Senior Minister of State for National Development Sim Ann announced in Parliament that the OneService Chatbot is currently on trial and will be rolled out in the later half of the year!

Why Chatbot?

In Singapore, if you have a feedback (or complaint) about a municipal issue (e.g., dirty lift lobbies, high-rise littering, illegal parking), you would want to bring it to the attention of the Municipal Services Office (MSO). An officer will log your case, get the necessary information pertaining to the case from you, and then route your case to one of the government agencies or Town Councils (which is strictly speaking, not a government agency, but we consider it an “agency” conceptually to simplify things) to handle your case.

In order for residents to submit feedback more quickly and conveniently (without having to speak to wait on the line and speak to an officer), MSO launched the OneService App in 2015 to allow residents to do the submission digitally. However, this non-conversational way of feedback submission has one drawback: The app cannot ask clarifying questions when essential information required to handle the case is missing. Also, the majority of residents submit feedback only once in a while and are not frequent complainers. Wouldn’t it be nice if they can give their feedback using their favourite messaging app instead of having to download another app to do so?

Thus, to make it even easier and more convenient for residents to submit feedback, MSO decided to work with GovTech to build the OneService chatbot for WhatsApp and Telegram!

The chatbot will take in the feedback and do three things:

  1. Predict the case type from the feedback, and select a template for the information required
  2. Extract required information from the feedback, use it to fill the template, and prompt the user to fill in the missing information
  3. Predict the agency which should handle the case using the feedback, and attachments (geolocation, images) submitted by the user

Different templates are used because different case types require different information. As shown below, a Potential Killer Litter case requires the Address of Incident with specific details like level and unit information, and the Type of Litter Thrown. An Illegal Parking case requires the Address of Incident information but does not need the level and unit information. Instead, it requires a Landmark (objected located nearby, for identifying incident location e.g., lift lobby, staircase landing, overhead bridge) to be specified, and the Vehicle Number. By predicting the case type, selecting the corresponding template, and auto-filling it, the chatbot will know what missing information to prompt for.

Templates for different case types

The resulting conversation will look something like this:

Actual conversation with chatbot in WhatsApp (Image provided by MSO)

What’s interesting about this chatbot is that Machine Learning is used to predict the case types and handling agencies, and to extract relevant information from the feedback. This project is a collaboration between GovTech’s Virtual Intelligent Chat Assistant (VICA) Team and the Data Science and Artificial Intelligence Division’s GovText team. The VICA team, being the chatbot experts, handles the conversations with the users, while we (the GovText team, being the domain experts in Natural Language Processing) perform the analytics and make predictions. In this article, I will explain how we built the analytics engine.

Interaction between feedback provider, chatbot, analytics engine, and case router

Building the Analytics Engine

Case Type Classifier

The first thing which we had to build in the analytics engine is the Case Type classifier which takes the user’s feedback as input and predicts the case type. With the OneService app in operation since 2015, MSO has collected a substantial amount of feedback from the public. Every time a case is looked into, an officer will tag a case with its corresponding case type, and this information is stored in the database. Since the feedback submitted through the chatbot will be similar to that submitted through the app, we can use the app’s data to train models for the chatbot.

In an earlier post, I wrote about how I used ALBERT to train a text classifier for Ask Jamie chatbots. This time round, we also used ALBERT, but did not just stop at that. While the Ask Jamie work was a Proof-of-Concept, this MSO analytics engine is going live into production for public usage. Other than accuracy, we also need to consider the models’ inference (prediction) time. This is especially so when we have a user waiting to validate the results. We can’t just keep him or her waiting or else the complaint will no longer be just about municipal issues. 😫

Thus, we trained classifiers using the following model architectures (consisting of a text transformation and a neural network) and compared their F1 scores, inference time, and training time:

  1. Term Frequency-Inverse Document Frequency (TF-IDF) Transformation + Linear Layer
  2. TF-IDF Transformation + 1 Hidden-Layer + Linear Layer
  3. GloVe Embeddings + Bidirectional Long Short Term Memory Model (BiLSTM) on + Convolutional Neural Network (CNN) + Linear Layer
  4. ALBERT + Linear Layer

Training Data

We used a selected subset of 166575 cases from two years’ worth of data in the OneService App. After doing an 80–20 train-test split for training and evaluation, we trained 62-class classifiers using each of the above-mentioned architectures using the training data.

Model Building

The details of the different model architectures, and how we built them for the experiment is described in a separate blog post here, for advanced readers. Do check it out if you are interested in NLP and deep learning!

Results

The overall accuracies of the different models are shown below:

Accuracies of different text classification models

Accuracy-wise, we can see that the baseline TF-IDF model without any hidden layer could already achieve an accuracy of 71.9%. Adding a hidden layer boosted the accuracy to 76.7%, showing that abstraction of word features works. Both ALBERT and the BiLSTM-CNN model improved the accuracy further but what is interesting here is that the BiLSTM-CNN had a slightly better accuracy than the ALBERT model, despite the latter’s sophistication. One possible explanation is that because the characteristics of the feedback sentences (mainly broken English) were different from those in the external corpora used in transfer learning (grammatically correct sentences from articles), we were not able to take full advantage of ALBERT’s sequence level transfer learning capability. Instead, using word embeddings from a larger vocabulary and augmenting them with local context worked better.

F1 scores for the individual case types using the best model, the BiLSTM-CNN, is shown below, where the blue bars show the F1 scores for each case type (scale on the left axis) and the red dots show the number of samples for that case type in the test set (scale on the right axis):

F1 scores of individual case types (BiLSTM-CNN Model)

By further discarding 10% of the predictions with the lowest confidence scores (which will be routed to human agents for verification), we were able to get an overall accuracy of 83.1%!

The time taken to train the models and do inference on the test set is shown below:

Total time taken (in seconds) to train a model
Total time taken (in seconds) to do inference on the test set

The models which used TF-IDF trained and inferred very fast. ALBERT took the longest for both, due to the complexity of the transformer models. Though the BiLSTM-CNN model was not the fastest, the inference time of 0.006 seconds per prediction on average was fast enough for us to make a prediction and get back to the user quickly. In this case, the 0.03 seconds per prediction taken by ALBERT was also fast enough, but we decided to go with the BiLSTM-CNN model because it had the highest accuracy.

Case Details Extractor

All is not lost for BERT though, as we used it to build our Case Details Extractor, which extracts relevant information from the feedback to fill predefined templates. The identification of relevant case details is simply a Named Entity Recognition (NER) problem, where we split the feedback text into tokens (words) and try to learn which tokens are relevant information to extract.

Training Data

Getting training data for the NER model was not as straightforward in this case because unlike the case type, the relevant words within a feedback text were not annotated in the OneService App workflow. We had to set up an annotation framework and get our MSO colleagues to help us annotate the different entity types which we would recognise and extract from the feedback text.

We set up a platform using Doccano for them to do the annotation. An example of an annotation on Doccano looks like this:

Sample NER annotation on Doccano

Altogether, our hardworking MSO colleagues, Christopher and Hong Wei, annotated 5600 feedback texts, which we split into a training set (4480 texts), and a test set (1120 texts) for evaluation. The number of annotations for each entity type is as follows:

Number of annotations for each entity type

In addition to the annotated entity types, we also used regular expressions to extract other entity types with fixed patterns like:

  1. Vehicle plate numbers
  2. Postal codes (6 digits)
  3. Bus-stop numbers (5 digits)

BERT For Token Classification

Using BERT for different downstream tasks

As explained in my Ask Jamie blog post , BERT is versatile because it does transfer learning to understand the language, and this (language) model can be adapted to perform many different tasks in the fine-tuning stage. The diagram above from the original BERT paper shows how BERT can be used to do four different tasks. We had previously used b) to do sentence (sequence) classification for the Ask Jamie data. Now, we will use d) to do token classification.

Implementing the Token Classifier for NER is quite straightforward, using the BERTForTokenClassification class from the Hugging Face Transformers library. A sample script to do this can be found here. Instead of using the albert-base-v2 pre-trained model which we used for sequence classification, we used the bert-base-cased model this time round. This is because the data used to train the former is all lower-cased model while the latter retains capitalisation information, which usually improves NER accuracy (e.g., street names in addresses are usually capitalised).

Results

Evaluating at a token level, we got a very good overall accuracy of 86.1% across all the seven entity types, with F1 scores of the individual entity types shown below using the blue bars. The red dots show the number of tokens of each entity type in the test set (scale on the right axis).

NER results

Post Processing

Even though we had recognised the different entities in the feedback text, our job was not done. Remember our purpose of doing NER? It was to automatically fill up a template, using information extracted from the feedback text. Some post-processing was required to clean up the extracted entities before we could present it to the user for verification. For example, consider the following feedback and its extracted entities:

Example of entities extracted from a feedback text (address has been replaced with a fictitious one)

Filling up the template accurately and precisely for this feedback’s case type, Potential Killer Litter, would make it look like this:

Filling up a template using the extracted entities

To do so, our intern, Cindy Wang, created a post-processing module which uses regular expressions and simple string level text analytics to do the following:

  1. Extract sub-components of addresses (e.g., level, unit number) from text tagged as Address of Incident
  2. De-duplicate multiple mentions of the same object (e.g., TV console, console), including those with singular and plural forms (e.g., window, windows)

After filling up the template with the initial feedback text, we can then prompt the user for the missing information, fill them in, and present the final filled template to him/her for verification!

Agency Classifier

After identifying the case type of a feedback, automatically extracting the details to fill the corresponding template, and getting the user to verify that the information is correct, the last thing which needs to be done is to route the feedback to the correct agency handling the case. We use the following inputs from the user to do decide which agency the case should be routed to:

  1. Feedback text
  2. Geolocation of incident location (x,y co-ordinates)
  3. Images of incident (optional)
  4. Case Type (verified by the user)

Model architecture wise, the agency classifier was an extension of the case type classifier: We used Bi-LSTM-CNN to represent the feedback text, and added on additional input features (i.e. the geolocation, images, and case type) just before the final linear layer of the neural network to learn which agency should handle a case, using training data from the OneService App. The model architecture is shown below, where the number in the feature boxes represent the number of dimensions of each input vector:

Combining additional features with text features for agency classification

One main difference between the agency classifier and the case type classifier is when the respective predictions are made.

For the agency classifier, the prediction is made only after the entire reporting process.

Whereas, for the case type classifier, which utilises the feedback to suggest relevant case types for the user to choose from (as opposed to selecting from a list of all possible case types in the OneService App), the prediction is made as the user waits for a response from the chatbot so that he/she can subsequently verify the response and fill up the corresponding follow-up questions.

Therefore, while the inclusion of geolocation and images could improve the accuracy of the case type prediction (by 2–3%), the deliberate choice to not include them was made, as processing the additional inputs would increase the wait-time by up to a minute, adversely affecting the user experience.

Nonetheless, users can be assured that their cases will be routed to the relevant agency, fairly accurately, even with the marginal loss in accuracy in predicting the case type.

Transforming Geolocations to Features

Geolocations have an important role in agency classification, because some types of cases can be handled by more than one agency based on just the case description alone. Depending on which agency’s land an incident took place in, or is nearest to, the agency assigned to handle the case might be different. For example, if a Tree Pruning case is reported within a housing estate, the nearest town council will be assigned to handle the case. However, if a similar case happens in a park (e.g., West Coast Park), NParks will be assigned to handle the case instead. OneMap’s land query API was used to get information on the different agencies in charge of the land.

Getting the agency in charge of a land area using the OneMap API

Other maps were used to further differentiate the type of land area owned by the agency (e.g. drain, car park, or nature reserve). With some additional processing to calculate the distances from points to land boundaries, my team-mate Amelia generated a 14-dimension array for each geolocation entry, with each entry representing the distance of the point to the nearest boundary of the land owned by an agency (normalised to a value between 0 and 1) as follows:

Converting geolocation co-ordinates to feature vectors

Transforming Images to Features

User-submitted images add additional information to the feedback text to help determine which agency should handle a case. For example, if there is not enough information in the feedback text and the images submitted captured cigarette butts, it is more likely that NEA should handle the case than LTA. The opposite is true if the images captured cars or traffic lights.

Our colleagues from the Video Analytics team used the You Only Look Once (YOLO) real time object detection system to detect 58 objects from the images. The detected objects’ distances from the center of the images were also computed, as objects which were in focus (captured near the center of the image) were considered more important. Two arrays of 58 dimension each were thus generated for each image, and they were then consolidated across all the images using max and min pooling for the objects’ existence, and their distances from the center, respectively. This created the following effect during aggregation:

  1. An object which had appeared in any of the images was considered to have appeared in the consolidated image array
  2. A detected object’s importance to the consolidated image array was inversely proportional to its minimum distance to the center across all the images

The figure below shows how the consolidation was done to get the final two image arrays per case.

Converting images to feature vectors

Results

The results of the agency classification are shown below. Overall, we achieved an overall accuracy of 87% because the top 6 performing classes all had F1 scores above 85% and collectively, they contained 96% of the data. For the rest of the classes, the F1 scores were not so good as the dataset was severely imbalanced and these classes lacked samples. This was also the same problem which we had with the Ask Jamie dataset and the case type classification. To improve these classes’ F1 scores, we have to collect more data, but this is not easy because people seldom report municipal issues which should be directed to these agencies. The good news, however, is that they make up only 4% of the cases and their poor performances would not affect the majority of the residents.

Agency classification F1 scores (only the names of the top-6 agencies are shown)

The effect of adding each of the new features cumulatively is shown below. Adding geolocation and image features resulted in marginal increases in the overall accuracy. However, adding the validated case type improved the accuracy significantly. This is expected because the case type has a strong influence on which agency will eventually deal with the case (e.g., Dog Nuisance > NParks, Mosquitoes Breeding > NEA), while geolocations helped to distinguish between the ambiguous cases because of the location of incident. Images helped by adding additional distinguishing factors which the text and associated case type could not. In our case, the improvement was marginal because the text had captured enough information to decide which agency to route the case to for most of the cases, and images helped only for the cases which had very brief descriptions (e.g., “Please fix this”, “Spoilt again”).

Overall accuracies for different feature sets

Challenges

Other than some of the classes not having enough data to get good accuracies, one other challenge which we face is that the models cannot classify feedback text consisting of multiple issues very well. Ideally, this problem should have been a multi-label classification problem, where each case has one or more labels depending on whether the case is about a single issue or multiple issues. However, because of the limitations of the workflow, each case is only labelled with the case type/agency of the most dominant issue, since most of the cases are single-issue cases. As a result, the engine will only predict the case type/agency of the most dominant issue for multiple-issue cases.

Many Hands Make Light Work

This AI analytics engine is the result of the work of many different parties. Within DSAID, the GovText and Video Analytics team worked closely together to build the engine:

  1. Han Jing — Product Management
  2. Charlton Lim— Software Engineering
  3. Watson Chua — Text Analytics
  4. Cindy Wang — NER Post Processing
  5. Amelia Lee — Geolocation Feature Engineering
  6. Chen Xukun— Image Processing
  7. Chua Teck Wee — Image Processing

Building the engine involves training models, which in turn requires annotated data. It would not have been possible for us to build good models if not for the help of our MSO colleagues:

  1. Christopher Lee
  2. Tammy Tan
  3. Low Hong Wei (currently with MND’s Housing (Social Support))
  4. Chen Weijun (currently with SNDGO)

who helped us to painstakingly validate and annotate the data. Apart from that, they also patiently explained the use case and requirements to us, to help us better understand how we could use the data to solve their problem. Wei Jun was also the one who shared the BiLSTM-CNN and geolocation feature engineering techniques with us, based on the success of his own initial experiments!

Lastly, we also have to work together with the VICA team from GovTech’s MOL division to make sure that the chatbot and the analytics engine work seamlessly!

What to Expect

At the time of writing, the chatbot is in the closed public trial stage and is scheduled for full public roll-out by the end of the year. Since it is still in development, there might be some minor differences between what is in the final system and what I have described here (e.g., UI, conversation flow). However, the technical approach, which is what I intend to show with this post, is unlikely to change much.

Stay tuned for the full public roll-out, and feel free to suggest ways of improving the analytics engine to us!

Note: The official announcement on the roll-out of the chatbot stated that it can currently predict the case type and agency-in-charge with 80% accuracy, which is slightly different from the values reported in this article which are 78.7% and 87%, respectively. This is because the values reported here are based on earlier experiments with two years’ data from the MSO OneService App while the values from the announcement are from a closed trial. The inputs to the analytics engine for the closed trial were more similar to how the chatbot will eventually be used but the dataset is much smaller. The accuracies will continue to change as we conduct more trials, but we expect them to be within a 5% range from 80%.

--

--

Watson Chua
DSAID GovTech

I'm a Lead Data Scientist at GovTech, specialising in NLP and Generative AI