Sentiment Classification with BERT Embeddings
Hands-on tutorial for sentiment classification on Amazon review dataset using pre-trained BERT Embeddings
Sentiment Classification has been one of the oldest and most important problems in the field of Natural Language Processing (NLP). It is the task of telling if someone likes or dislikes the particular thing that they’re talking about. Getting domain specific annotated training data usually becomes a challenge, but with the help of word embeddings, we can build good sentiment classifiers even with only reasonably modest-size label training sets. There have been a plethora of pre-trained word embeddings readily available these days such as Word2Vec, GloVe, Fasttext, ConceptNet NumberBatch, etc. But, they have a problem that these are non-polysemic in nature which means that we get only one representation of a word despite of it’s occurrence in different context. In this blog, we will explore embeddings from Google’s BERT model. It’s highly unlikely that you have not heard this name as it is very popular(referred to as ImageNet moment for NLP) in machine learning community nowadays. I will still summarize it a bit for newcomers.
What is BERT?
BERT stands for (Bidirectional Encoder Representations from Transformers) is a NLP model developed by Google for pre-training language representations. It leverages an enormous amount of plain text data publicly available (Wikipedia and Google Books) on the web and is trained in an unsupervised manner. It is a powerful model that is trained to learn the language structure and it’s nuances by training a Language Model. BERT has a deep bi-directional structure to it unlike ELMo, which is a shallow bi-directional and OpenAI GPT which is uni-directional in nature. Bidirectional nature helps the model to capture the context from previous words and words ahead of it any given time t.
Getting into the nitty-gritty of it’s internal working is beyond the scope of this blog. You can read more about it here. There have been improvements proposed over existing BERT such as RoBERTa and XLNet. All the models share a common trade-off of accuracy and speed. Read this for a detailed comparison between all the models.
Today as a part of this blog we will go through step-by-step in building a text classification system using pre-trained BERT model word embeddings.
We will be using a small fraction of some million amazon reviews available online. You can download the subset from here. The data contains 10,000 reviews and 2 sets of class labels.
Here, __label__1 and __label__2 corresponds to Negative and Positive class respectively.
- We will be working with bert-as-service python library that uses BERT as a sentence encoder and hosts it as a service via ZeroMQ, allowing you to map sentences into fixed-length representations. You can install it from here.
- You can download BERT Base Model from here and keep it inside models/ directory. However, moving the downloaded model to models/ folder is not a necessary step but a good practice to keep for management purposes. You can find a list of all pre-trained models here.
Let’s do some bit of exploratory analysis for better understanding of our data.
- Class Distribution — It is one of the most important things that one would want to analyze to make an educated call on various questions like Data Augmentation, Training Penalty, etc. Since, our distribution (shown below) is balanced, we will train our model with the same without any tampering.
- Review Length (Min, Avg, Max) — This parameter will let us decide on the parameter maximum sequence length (max_seq_len) per utterance that we need to encode. One can play around with this parameter as a trade-off between accuracy and speed. For now, we will set it to NONE for it to dynamically use the longest sequence in a (mini)batch as our best bet. Having said that, for this dataset we got 438 chars as average review length, 1015 as max and 101 as min length of the review text.
This insights us in choosing some of the parameters while launching our BERT server which can be launched using the given command —
$> bert-serving-start -model_dir models/uncased_L-12_H-768_A-12/ -num_worker=5 -port 8190 -max_seq_len=NONE
The command launches an uncased-12Layered-768Hidden-12AttentionHead model on port 8190 with 5 workers (5 concurrent requests) and dynamic sequence padding. Once you see the message all set, ready to serve request! We are good to go and start crunching some embeddings for our input sentences.
Firstly, we load our data in a pandas data frame and write our basic pre-processing pipeline. As a part of our pipeline, we convert our text to lower case, remove numerical and special characters and expand the word contractions. Below snippet does all of this —
The next step is to convert the text sequences to their respective contextually rich numeric representation using pre-trained BERT base model token embeddings. Below snippet does our job and gets us and returns us a 728 dimension vector representation of input review sentence.
BERT client makes an http call to the server with the input sentence the server handles the tokenization, OOV, appending starting and ending tokens, etc and returns the embeddings. Post this, we finally train our classifier for our task with input as the review feature vector and output as the sentiment class for it.
We train 3 different classifiers in their default parameter setting for the comparison purposes on accuracy scale.
The numbers can be boosted further by fine-tuning the whole model and train a dense layer on top of it instead of just using word embeddings and a classifier. You can read more here. All the computation and experiments for the purpose of this blog were done on Intel DevCloud machines.
Feel free to comment and share your thoughts :)