Introduction to BERT and its application in Sentiment Analysis

Tarique Akhtar
Analytics Vidhya
Published in
8 min readNov 30, 2021
Photo by Brett Jordan on Unsplash

BERT is super exciting algorithm and not only for me, but for the whole community of NLP(Natural Language Processing).

It’s super powerful. It’s super interesting. And I’m really glad to share it with you, how to use it and how it works.

Before going to BERT, let’s Just take a look on NLP in a more general way. NLP(natural language processing) is the part of A.I. that deals with human language. Actually, It’s pretty much everywhere. For instance, the Web or your search engine uses NLP to optimize the results. On a recent note, the vocal assistance's like Siri or Alexa use NLP technique to understand what we say. NLP is also used in email box for spam detection. NLP is used in the translators that are widely used. Chatbots are built using NLP.

So there is a lot of research done in this field, and BERT is an advanced algorithm for NLP released by Google.

Here are few points about BERT.

  1. Google’s NLP algorithm released by the end of 2018
  2. Most game-changing result in NLP since the last 5 years
  3. Provides a better understanding of words and sentences in the context
  4. Already implemented in Google search engine for instance

BERT is a tool that is supposed to understand the language and provide what we call a language modeling. Google already started to use it in the search engine. Below image shows example before and after BERT implementation.

Image from https://cloud.google.com/ai-platform/training/docs/algorithms/bert-start

Referring to above image, With the BERT model, we can better understand that “for someone” is an important part of this query, whereas previously we missed the meaning, with general results about filling prescriptions.

What is BERT?

BERT stands for Bidirectional Encoder Representations from Transformers. I will split this full form into three parts.

Encoder Representations: BERT is a language modeling system which is pre-trained with huge corpus and huge processing power. We just need use this pre-trained model and fine-tune it for our need.

What is Language modeling? It means that BERT gives the best, the most efficient and the most flexible representation for words and sequences. For example, We give a sentence or two sentences to BERT and it will generate a single vector or a list of vector that can be used as per our need.

from Transformer: Transformer is basic building block for the architecture of BERT. So Google developed transformer which was designed to tackle many sequences to sequence tasks to build a translator or a chatbot. Then they use the same transformer in a small part and use it even smarter way to create BERT.

Bidirectional: Most of the time in NLP project, we want to predict the next words in our sentences. And to predict next word i.e the right part of the sentence, we would need access to the left part of the sentence. Even sometimes we have access only to the right part, and we need to predict left part of the sentence. Special case is when some model is trained with left parts and right parts of the sentences separately and then concatenate both. So it becomes pseudo bidirectional. BERT uses left and right context when dealing with a words and It achieves to have a fully bidirectional model. This means it has access to whole context or sentence to predict words,. This is what make BERT more powerful.

Having learned above concept, Let’s jump to the points where BERT can applied.

Application of BERT:

  1. Use the tokenizer to process the data
  2. Use BERT as an embedding layer
  3. Fine tune BERT, the core of your model

In this blog, we will learn about BERT’s tokenizer for data processing (sentiment Analyzer).

Sentiment Analyzer:

In this project, we will try to improve our personal model (in this case CNN for classification) using tokenizer with BERT.

So to start with, we will first build a classification model to assess if a tweet is positive or negative in terms of feeling..

Train and Test Data:

You can download the data from below link. http://help.sentiment140.com/for-students

Complete project code:

I have used Google Colab as editor for this project as it is a hosted Jupyter notebook service that requires no setup to use, while providing free access to computing resources including GPUs. I have anyway uploaded the complete code to my Github repository.

https://github.com/Tariqueakhtar/Machine-Learning/tree/master/Sentiment%20Analysis%20BERT

Install and import packages related to BERT:

Apart from the basic libraries like Pandas, numpy, BeautifulSoup this project, we need to install two packages related to BERT i.e bert-for-tf2 and sentencepiece on your as shown below.

We also have to use libraries like tensorflow, bert and tensorflo_hub(it is a repository of trained machine learning models ready for fine-tuning and deployable anywhere. Reuse trained models like BERT and Faster R-CNN with just a few lines of code.)

Loading data on colab:

Once we load the training data using Pandas, let’s see sample of the data.

Image by author

Below are the columns in the dataset.

  1. sentiments(Binary Target variable)
  2. id
  3. date
  4. query
  5. user
  6. text or tweets by user

We only need two columns for this project i.e text and sentiments as we just need to predict the feeling as positive or negative by analysing the tweet by user.

Data Cleaning:

After importing the training data on your colab, we need to clean it by dropping unnecessary columns like id, date, query and user. Below is the code for the same.

The dataframe looks like below.

Image by author

Now, let’s clean the text as it has special characters in it. Below is the code for cleaning.

Here is the result after cleaning.

Image by author

Tokenization using BERT:

In this part of the project, we will use BERT tool for tokenization as follows.

So what are we doing in the above code. Let’s understand step by step from each line of code.

  1. Create fullTokenizer from bert.
  2. Create a bert layer by calling tensorflow_hub and pass the path of the prebuilt model as a url in KerasLayer. In this case, we will directly use this model and won’t fine tune it that is why we passed parameter trainable=False. This line of code will just give information for the tokenizer.
  3. Now we are using vocab file from the tokenizer.
  4. Doing the lower case for tokenizer.
  5. With all the above information, we are able to create the tokenizer

Pass the tweets/text to Tokenizer:

Now use all the tweets/text from cleaned_data and pass them to tokenizer and convert each tokens to ids. Below is the code for the same.

The sample output for above code looks like below image.

Image by author

Dataset Creation:

We will create padded batches (so we pad sentences for each batch independently), this way we add the minimum of padding tokens possible. For that, we sort sentences by length, apply padded_batches and then shuffle.

Source Code by author

Training and Test dataset:

In order to create train and test dataset, we will take BATCH_SIZE = 32 and use this batch size on the whole dataset. Then we will shuffle the data and get the tenth part of the batches for test data and rest are training data.

Model building and training:

So we have done all the data processing phase and now we are ready to start building our model. This will be the CNN that I talked about at the beginning of this post. So the idea here is to have three different convolutional filters of size two, three and four and then we will just take the max, concatenate everything and use it, then try to get our classification done.

Image by author

After building the structure of the model, we need to pass the parameters precisely to start training. Below the parameters passed in this example.

Fitting CNN model to training data:

We ran 5 epochs to fit the model and we got training accuracy score as below.

Epoch 1/5— loss: 0.4289 — accuracy: 0.8025

Epoch 2/5 — loss: 0.4289 — accuracy: 0.8025

Epoch 3/5 — loss: 0.3412 — accuracy: 0.8517

Epoch 4/5 — loss: 0.3010 — accuracy: 0.8715

Epoch 5/5 — loss: 0.2638 — accuracy: 0.8885

image by author

Evaluate Model on Test dataset:

Once the CNN model is trained, we need to evaluate its performance on test data and below is the loss and accuracy.

loss: 0.4114 — accuracy: 0.8322

image by author

Let’s evaluate the model on couple of sentences. I have written a separate function for this purpose. Below is the function.

Source code by author

We just need to pass the sentence into this function to get its sentiment as positive or negative.

image by author
image by author

Conclusion:

It seems the difference in train and test accuracy is about 5% which says that we still have to fine tune the model which I leave up to you as task. Let me know your thoughts in terms of tuning this model.

Thanks for reading..

Reference:

--

--

Tarique Akhtar
Analytics Vidhya

Data Science Professional, Love to learn new things!!!We can get connected through LinkedIn (https://www.linkedin.com/in/tarique-akhtar-6b902651/)