From Classical NLP to Transformers for Aspect Based Sentiment Analysis

Published in

TrustYou Engineering

7 min readMay 20, 2022

Introduction

At TrustYou, we have been doing Aspect Based Sentiment Analysis (ABSA) for years to extract actionable insights for our clients. This is done by a combination of classical NLP techniques and statistical machine learning. Although our current system works very well, we wanted to improve it even more.

In late 2018, BERT [1] was released which used a type of architecture called the Transformer network. This was subsequently followed by other language models like XLNET [2], RoBERTa [3], DistilBERT [4], GPT-3 [5] and many more in the next few years. These language models showed extremely good performance in different tasks and exceeded previous benchmarks.

We decided to experiment with this architecture and after hustling for a few months, we have a model running in production. In this blog, I will talk about different aspects and challenges we faced during this task.

Different steps involved in productionizing a Transformer based model from scratch

When trying to train a new model from scratch and then deploying a production ready model, there are different steps involved in the process. We are going to talk about each of the following steps.

Data collection and preparation
Finding a suitable language model
Fine-tuning the language model for our task
Evaluating results from our models
Deploying the models to production

Data collection and preparation

While training any machine learning systems, this is perhaps one of the most important steps. We need good quality data not only for training but also for proper evaluation of models.

Data collection

From years of doing ABSA, we already have millions of reviews analysed by our system which provided a good starting point. Since no system is perfect and neither is ours, we couldn’t fully rely on the available data as it has some incorrect predictions. We needed to do some modifications on our data to get it up to the quality standards we desired.

The training data for ABSA (or Targeted ABSA) usually have aspect terms and categories with associated polarities like in case of “the pizza was good”, the training data would look something like {“term”: “pizza”, “category”: “food”, polarity: “positive”}. In addition to this, for our task we extracted the opinion text (text containing the aspect and the polarity) as well.

To generate our training data, we took the predictions from our existing models and manually curated the data with help of internal annotators. Some of the things we have to keep in mind are:

Seasonality in reviews: Reviews change with seasons, for example in winter you would expect tourists to talk more about hot water, heater etc. as compared to summer. The training data should contain reviews from different seasons so as to not bias our models.
Real world representation: Other factors apart from seasonality might also affect reviews. For example, pre covid-19 and post covid-19 reviews would be very different and we need to ensure that our data is representative.
Proper distribution of polarities and aspects: In addition to the above points, our training data should also represent the actual distributions in terms of polarity and aspects. For some very under-represented classes, we might need to do class balancing to enable the model to learn them.

In case you do not have sufficient data, you can benefit from one of the data augmentation techniques for ABSA mentioned in this blog written by TrustYou [6].

Data preparation

In this step, we converted the data into a format which the models would be able to process. These are usually the same steps that are needed for any machine learning problems — splitting data into train-test sets, lower casing texts, converting labels into one hot encoding, removing duplicates or similar data, shuffling the data to remove ordering bias and so on.

In our case we used multiple data splits. This is helpful for a robust training and evaluation as it reduces data bias even further. We first kept a held out dataset for final evaluation and reporting of quality, we call it the gold corpus. Then we used a 5-fold cross-validation dataset consisting of a training and evaluation set (~20%). The validation set used for early stopping was taken from training sets (~10%).

Finding a suitable language model

This step and the steps described in the next section went hand in hand because until you tune a model properly, you never know which language model would suit your purpose.

We had three criteria we wanted to make our decision on -

Quality of the predictions
Latency or the time taken by our model to make the prediction
Feasibility in terms of infrastructure

We experimented with a number of different pre-trained models -

Bert Base (uncased) [7]
Bert Base (cased) [8]
Bert Large (uncased) [9]
Bert Large (cased) [10]
Distilbert (uncased) [11]
Distilbert (cased) [12]

We decided to go with the uncased version of the model, as it gave better quality results. Given the above three criteria we wanted to evaluate on, bert-large was clearly a winner in terms of quality while distilbert was a winner in terms of latency and feasibility, and bert-base was that sweet spot which gave good results for all three. Hence, we decided to use the bert-base-uncased model as our language model.

Fine-tuning the language model for our task

The Bert family of language models mentioned above are trained on two tasks — masked token prediction and next sentence prediction neither of which can be directly used for ABSA. The main challenge here was to adapt the language model for ABSA. After analysing the various work done in this field, we were able to frame -

Opinion extraction as a question answering problem
Polarity detection as a classification problem
Aspect extraction as a multilabel classification problem

Opinion extractor and polarity detector models were jointly trained to achieve improved performance.

Another challenge we faced was adapting the tokenizers for our tasks. The Bert models available in HuggingFace use the traditional BertTokenizer [13] with which text reconstruction using reverse mapping was extremely difficult because the tokenizer didn’t provide any functionality to de-tokenize the output. We adapted the models to use the BertTokenizerFast [14] which supported the missing functionality.

Evaluating results from our models

There are various aspects of evaluation when training a deep learning model. As our system consisted of multiple models doing different tasks, at a very basic level, the loss on the validation set is the first evaluation metric that was used to track training progress and trigger early stopping when necessary. This is pretty standard in any ML training.

Once the models are trained, then we have to evaluate the whole system by integrating the models together and also compare against our existing system. As mentioned previously, for this purpose we used a held out annotated dataset as gold corpus. We generated predictions from our new system and compared it against the gold corpus in terms of precision, recall and F1. As there were multiple models working together, we needed to evaluate results not only for the final prediction but also individually for the opinion text, polarity, aspects and all possible combinations of them. Based on these metrics, we took decisions to further improve our model — for example we detected low scores for certain aspects and introduced more training data for them or adjusted the thresholds used by different models to tune the precision-recall tradeoff.

In the second part of evaluation, we compared the new system against the existing system on our gold corpus. On seeing promising results, we decided to go one step further and did an internal A/B testing of the current system versus the new system using live reviews from our database.

Deploying the models to production

Let’s admit, deploying deep learning models to production is hard. We need to have the right balance of resources we spend versus the efficiency we gain. After carefully considering the PROs and CONs involved in going for an on-premise deployment versus a cloud deployment, we decided to go for AWS. The main reasons for this were better deployability, maintainability and scalability.

We used the AWS SAM CLI [15] framework for our deployment. We deployed using the Lambda function from the container image stored in Amazon ECR. The function is exposed using the API Gateway. All of this is done using a CloudFormation template with the help of SAM CLI.

There were two major challenges while deploying the model -

Latency: We have to optimise the Lambda function so that the latency is acceptable. This was done by tuning the allocated memory as the CPU power is directly proportional to it [16].
Cold Starts: AWS provisions the lambda function as per requirement, so during the first call to the lambda function, the models are initialised and it is much slower than the subsequent calls. After initialising the function, AWS keeps the function state active for some time. This is done to increase provisioning efficiency as AWS anticipates more calls to the function. For us, we had two options to tackle the cold start problem — provisioned concurrency [17] and warm up call. Since our system uses a mini batch processing mechanism in a distributed environment, we decided to go for a warm up call where a dummy call is made to the lambda function to initialise the models.

Final Thoughts

Transformer networks are gaining popularity with every passing day and they are showing great results on different types of tasks. We at TrustYou, thoroughly enjoyed our journey playing with these deep learning models. I hope this blog gives you a good insight into the different phases and challenges when productionizing deep learning models from scratch. For more such interesting articles, keep following TrustYou in Medium.