Building Text Classifiers to Handle Municipal Issues — Experiments with TF-IDF, GloVe, BiLSTM-CNN and BERT

Published in

DSAID GovTech

10 min readApr 5, 2021

This post is an extension of the main post on building an analytics engine for the MSO chatbot. The intended audience for this post is advanced readers who are interested in the Natural Language Processing and Deep Learning specifics.

In this post, we explain the model architectures of the different text classifiers we experimented with, which are:

Term Frequency-Inverse Document Frequency (TF-IDF) Transformation + Linear Layer
TF-IDF Transformation + 1 Hidden-Layer + Linear Layer
GloVe Embeddings + Bidirectional Long Short Term Memory Model (BiLSTM) + Convolutional Neural Network (CNN) + Linear Layer
ALBERT + Linear Layer

The models are trained in PyTorch, using the Adam optimiser to minimise the Cross Entropy loss, on an AWS P2 instance with one Nvidia Tesla K80 GPU.

The objective of the experiment was to compare the model architectures in terms of their accuracies, inference time, and training time on our dataset, so that we could select the best one to use in our analytics engine.

TF-IDF Transformation + Linear Layer

TF-IDF is one of the basic text transformation methods in text analytics. It is simple and works well for many applications, in particular keyword-based information retrieval. What it does is simply represent each document (feedback) using all the distinct words in the corpus as features. If a document contains many occurrences of a certain feature (word), its score for that feature will be high. This score is then multiplied by the feature’s IDF score: Words which appear in less documents (rare words) have higher scores, because rare words help to distinguish between documents better than common words.

As a keyword-based transformation method, TF-IDF has its limitations. For example, a document which mentions “vehicle” many times and another which mentions “car” often, are not considered similar. As a result, many features (typically tens of thousands) need to be used to train a good classifier. In this case, we are using TF-IDF + Linear Classifier as a baseline method to compare the performances of the different model architectures.

TF-IDF Transformation + 1 Hidden-Layer + Linear Layer

Using only the linear classifier with the TF-IDF transformation allows it to learn what a class’ samples are like using linear combinations of the individual words. By adding a hidden-layer, we train the classifier to do some abstraction of the individual word features (e.g., different combinations of words which frequently appear in some classes) to see if the accuracy is better. We used a 100-node hidden layer.

Glove Embeddings + Bi-LSTM-CNN + Linear Layer

Bi-LSTM-CNN models are the predecessors of BERT, and have shown extremely good accuracy for text classification.

Bidirectional LSTM model

LSTM models are recurrent neural network models which model sequential information (which text is) very well. We used pre-trained word embeddings (specifically, GloVE embeddings with a vocabulary size of 6 billion words and 300 dimensions) as input to the LSTM models. By applying machine learning on large external corpora (e.g., Wikipedia, news articles), the semantic information of words is captured in their embeddings using their context (words surrounding them). For example, “dog” and “puppy” have similar embeddings because they are both used in sentences like “I took my ____ to the vet because it’s not eating.” and “My __ chewed on my slippers.” Thus, we avoid the problem which we had with TF-IDF, where related words do not have similar representations.

The pre-trained GloVE vectors embed word contexts in external corpora. To embed the context within our dataset itself, we enhanced the inputs by training 50-dimension GloVE embeddings on our training data for 10,000 iterations, and concatenating them with the original pre-trained GloVE embeddings. By doing so, words which were not found in the vocabulary of the pre-trained GloVE vectors could be represented, and words which were already in the vocabulary had additional dimensions to model their contexts within all our feedback data.

To generate the output/hidden state of the current word at each stage of the model, LSTM takes as input the embedding of the current word, and the hidden state of the previous word. The hidden state of the previous word is turn generated using its own embeddings and its previous word’s hidden state. As such, information in each word is passed on from the first to the last via the hidden states, and the hidden state of each word contains the contextual information of the sentence up to the current word. To get the contextual information of the word in a sentence based on both the words before and after it, we apply the LSTM from the front to back of the sentence and vice versa, and concatenate the two hidden states of the word (which is why we call it a bi-directional LSTM). The figure below shows how this is done:

Adding contextual information to embeddings using BiLSTM

The hidden state for upstairs in the forward pass contains the propagated contextual information from the preceding words in the sentence, someone and from, while the corresponding hidden state in the backward pass contains the propagated contextual information from the words keeps, throwing, and bread. By using the concatenation of the two LSTM hidden state vectors to represent a word, we can capture the contextual information from the words preceding and following it. This is especially useful for words which can have different meanings, depending on the context (e.g., apple the company and apple the fruit). While GloVe only represents one word with a single embedding regardless of how many senses it has, BiLSTM can “tune” it to the correct meaning with respect to the sentence.

Convolutional Neural Network

We next used the BiLSTM hidden states as input to a 1D-Convolutional Neural Network. This technique creates features representing different combinations of (similar) words within a window size, regardless of their location within the sequence. This is useful to us because in many cases, there are phrases within feedback texts which can distinguish the case types, regardless of which part of the feedback text they are located in. For example, consider the two sentences:

“This morning, I was taking a walk with my dog in the park when I saw a python eating a cat. I went to look for help but couldn’t find anyone. The snake slithered away while I held on to my barking dog. I am afraid that it will go on to harm some other small animals or people. Please do something about it.”
“Snake swallowing a cat! I was driving along Bukit Timah road when I witnessed this beside the Shell petrol station. The python should still be somewhere in the area.”

“Python/Snake swallowing/eating a cat” is strong evidence that these feedback texts belong to the Animal Issues case type, even though they appear in different places in the two sequences. CNNs can help us to pick out such features in text and by doing so, it helps the model to focus on the important things being talked about, among possibly many other things, within the text to decide on its best class.

For our implementation, we used a window size of 3, to represent the different kinds of word combinations within two words of one another. The figure below shows how convolution is applied with one kernel to the output word vectors of BiLSTM. After applying CNN, we do both a max pooling and an average pooling to get two output values per kernel.

Applying convolution and pooling to the BiLSTM output

We repeated the same convolution with 64 different kernels to get a 128-dimension output vector from the CNN which consisted of 64 max-pooled values and 64 average-pooled values.

Overall Architecture

To summarise, the different components of the BiLSTM-CNN model have the following effects:

GloVe Embeddings: Represents words as vectors by using their contextual information both in external corpora, and within the training data, to better model the relationships between words
BiLSTM: Modifies the word vectors using contextual information within the sequence to better model the representation of the word with respect to the other words in the sequence
CNN: Creates features representing groups of words which can be associated with their classes (labels), regardless of their positions within the sequence

The overall BiLSTM-CNN model architecture is as follows:

ALBERT + Linear Layer

BERT was discussed extensively in my previous blog post and we used the same ALBERT model for this experiment. I will reproduce my explanation of the BERT architecture here with some modifications. If you have read the previous blog post, you can skip this part.

To briefly summarise, BERT makes use of the transformers architecture to learn how to represent text segments (e.g., sentences) as numbers. As in GloVe, words are represented as embeddings but what’s different is that the values in the embedding of a word change depending on what the other words in the sentence are using “self-attention” (similar concept to Bi-LSTM, but faster because there’s no need to compute the hidden states vectors).

The embeddings are further fed into a feed-forward neural network to form an “encoder” block, and multiple encoder blocks are stacked to give the sentence a “deep” representation.

One encoding block (Image taken from Jay Alammar’s The Illustrated Transformer)

Stacking encoders to form BERT (Image taken from Jay Alammar’s The Illustrated BERT, ELMo, and co.)

BERT was taught how to “understand” English by making it read the whole English Wikipedia corpus, and other big corpora crawled from the internet. It shows its understanding by being able to represent text using embeddings which have mathematical properties similar to the properties of the actual texts in the English language. For example, the BERT embedding for sentences 1 and 2 have a higher cosine similarity than that of sentences 1 and 3:

“I live in London with my family.”
“New York is my hometown.”
“Sharks are endangered because they are widely hunted for their fins.”

The feedback texts were fed the questions into a pre-trained model and to get their embeddings. We used ALBERT (“A Lite BERT”) instead of the original BERT because experimental results showed that ALBERT was able to achieve better results with less memory consumption and run-time, due to a huge reduction in the number of parameters. The following diagram shows how a sentence is converted to an ALBERT embedding:

Getting an ALBERT embedding of a sentence

The [CLS] token is a special token whose embeddings represents the whole sentence’s embedding after the model is fine-tuned (explained in the next section). The alternative to using the [CLS] token for the sentence representation is to take the average, maximum, or minimum of each dimension, of all the word embeddings in the sentence. Conceptually, some sort of pooling (e.g., max, min, average, or a combination) is done, through the use of multi-headed self-attention, and amending the feed-forward neural network weights, during the fine-tuning stage to make the [CLS] embedding a sentence vector. You can think of this as a weighted combination of the cosine similarities of all the tokens’ word embeddings (in a complex way).

Fine-tuning for Classification

Just like how we pre-trained GloVe embeddings on our training corpus for the BiLSTM-CNN model, we had to make the ALBERT model understand our feedback data to get more accurate embeddings in our context. This step of getting the BERT model to understand our data is known as the fine-tuning step, where we use the model to perform a task, and the model weights get updated by the data through the task.

In the transfer learning stage for the pre-trained (AL)BERT models, “fake tasks” like masked word prediction, and next sentence prediction were given to the models to update the weights in an unsupervised manner. However, in our case, we had a real task on hand, which was to learn how to predict the case type, given a feedback text. Using Hugging Face Transformers’ AlbertForSequenceClassification model, we were able to do both simultaneously by attaching a sequence (sentence) classification head, which is a one layer feed-forward neural network, to the pre-trained ALBERT model. During training, both the weights of the feed-forward neural network and the weights of the ALBERT model would be updated so that it could learn both how to understand, and classify, the data at the same time. Diagrammatically, the model architecture looks like this:

More in-depth explanations of BERT have been published in many blog posts and papers. Jay Alammar’s The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) and The Illustrated Transformer are highly recommended readings, as he explains how BERT works very clearly using illustrations. Other good explanations can be found at:

Comparing the Model Architectures

Among the four different model architectures we experimented with, the BiLSTM-CNN model gave us the best accuracy and its inference time is fast enough for the chatbot to respond to the user quickly. Therefore, we used it for the analytics engine. The results are shown in the main post, check them out here!