[Fine Tune] Fine Tuning BERT for Sentiment Analysis

Shawn Ding
8 min readJan 2, 2023

--

Key tags: fine-tune; BERT; sentiment analysis; classification

In this blog, we will introduce fine-tuning the BERT model for IMDB Review Sentiment Analysis. This tutorial is designed to be general and can be applied to fine-tuning BERT for sentiment analysis tasks on your own dataset.

The 🐋 IMDB Review Sentiment Analysis Dataset 🐋 will be utilized for a sentiment classification task. This dataset includes movie reviews from IMDB, with accompanying sentiment labels. The reviews included in the dataset are lengthy in order to ensure their quality, with most exceeding 200 words. The dataset has been split into training, development, and test sets, containing 39,723, 4,998, and 4,995 reviews, respectively. Each line in the file includes two columns: the text of the review and the corresponding sentiment label (0: positive, 1: negative).

💬 The IMDB Review Sentiment Analysis Dataset 💬 https://github.com/xding2/Fine-Tuning-NLP-Model/tree/main/Sentiment%20Analysis

Before we start coding, let me answer a general question.

👁️‍🗨️ Is some pre-processing needed when fine-tuning the BERT?

It is not necessarily true that stop words should not be removed when fine-tuning a BERT model. In fact, some practitioners choose to remove stop words as a preprocessing step before fine-tuning a BERT model because it can help the model focus on the most important parts of the text and potentially improve its performance. However, whether or not to include this preprocessing step depends on the specific problem you are trying to solve and the goals of your fine-tuning. If removing stop words improves the model’s performance, then it can be a useful preprocessing step to include. On the other hand, if removing stop words degrades the model’s performance, then it may be best to leave them in. Ultimately, the decision of whether or not to remove stop words should be based on the results of experiments and evaluations on your specific dataset and task.

If you don’t want to do text pre-processing, there is no problem to do the fine-tuning of the BERT model directly. Despite this, I used the str.replace() and apply the lambda function to remove stopwords based on nltk.stopword.

✍ STEP.1 IMPORT PYTHON LIBRARY 💦

# if you do not have transformers, please !pip install transformers
import transformers
from transformers import get_linear_schedule_with_warmup
from transformers import BertTokenizer
from transformers import BertForSequenceClassification
from transformers import AdamW

# if you do not have torch, please refer to https://pytorch.org/ [INSTALL PYTORCH]
import torch
from torch.utils.data import TensorDataset
from torch.utils.data import DataLoader
from torch.utils.data import random_split

import pandas as pd
import re
import string
import operator
import numpy as np
import random

from sklearn.model_selection import StratifiedKFold, StratifiedShuffleSplit
from sklearn.metrics import precision_score, recall_score, f1_score

This code is using the transformers library to print the version of the library. It is also setting a seed value of 38 for the random number generator and specifying that the device being used is a GPU. The device variable is then printed to the console. This code does not execute any actual model training or inference, it is simply setting some initial values and prints some information to the console:

print(transformers.__version__)
seed = 38
device = torch.device('cuda')
print('\n')
print(device)

random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True

✍ STEP.2 Text Pre-Processing 💦

This code is performing several preprocessing steps on three data frames: df_train, df_test, and df_val. It first imports several libraries including stopwords from the nltk library and downloads the list of stop words for English. The code then defines a list of stop words and reads the data from three CSV files stored in the directories specified by the file paths. The code also initializes a BERT tokenizer using the ‘bert-base-uncased’ model and sets the do_lower_case parameter to True. Next, we print the shapes of the three data frames and initialize a new column in each data frame called ‘pre_text’. The code then applies a series of preprocessing steps to the ‘text’ column in each data frame and stores the resulting preprocessed text in the new ‘pre_text’ column. These steps include: converting the text to lowercase, removing punctuation, removing HTML line breaks, and removing stop words. The resulting preprocessed text is then stored in the ‘pre_text’ column for each data frame:

from nltk.corpus import stopwords
import nltk

nltk.download('stopwords')
stop = stopwords.words('english')

df_train = pd.read_csv('./dataset/Train.csv')
df_test = pd.read_csv('./dataset/Test.csv')
df_val = pd.read_csv('./dataset/Valid.csv')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

print(df_train.shape, df_test.shape, df_val.shape)
print('\n')
# check the model max len = 512
# print(tokenizer)
# get the list of {content, token, ids}

df_val['pre_text'] = df_val['text'].str.lower()
df_val['pre_text'] = df_val['text'].str.replace(r'[^\w\s]+', '')
df_val['pre_text'] = df_val['text'].str.replace('<br />','')
df_val['pre_text'] = df_val['text'].str.replace('<br />','')
df_val['pre_text'] = df_val['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

df_train['pre_text'] = df_train['text'].str.lower()
df_train['pre_text'] = df_train['text'].str.replace(r'[^\w\s]+', '')
df_train['pre_text'] = df_train['text'].str.replace('<br />','')
df_train['pre_text'] = df_train['text'].str.replace('<br />','')
df_train['pre_text'] = df_train['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))

print('Text Pre-Processing Finish!')

# To simplify the process, I make all df_val['text'] = df_val['pre_text']; df_train['text'] = df_train['pre_text']
df_val['text'] = df_val['pre_text']
df_train['text'] = df_train['pre_text']

print(df_train.shape, df_test.shape, df_val.shape)
content = df_train['text'].values
labels = df_train['label'].values

✍ STEP.3 Encode the Input Token 💦

This code defines a function encoding_process which takes in a list of text strings and returns a tensor of encoded token IDs. The function iterates over the list of text strings and uses the encode method of the tokenizer object to convert each text string into a sequence of token IDs. The add_special_tokens argument is set to True, which will add special tokens such as '[CLS]' and '[SEP]' to the beginning and end of the tokenized input, respectively. The max_length argument is set to 256, which will truncate or pad the tokenized input so that it has exactly 256 tokens. The pad_to_max_length argument is set to True, which will pad the input with zeros if it is shorter than 256 tokens, and truncate it if it is longer. The return_tensors argument is set to 'pt', which will return the encoded token IDs as a PyTorch tensor. The list of encoded token ID tensors is then concatenated into a single tensor using torch.cat, and this tensor is returned by the function:

def encoding_process(_content):
get_ids = []
for text in _content:
input_ids = tokenizer.encode(
text,
add_special_tokens = True,
max_length = 256,
pad_to_max_length = True,
return_tensors = 'pt')
get_ids.append(input_ids)

get_ids = torch.cat(get_ids, dim=0)
return get_ids

The code then calls the encoding_process function on the content variable and assigns the returned tensor to the get_ids variable. The labels variable is converted into a tensor using the torch.tensor function. The return_dict variable is set to False.

The val_content variable is set to the 'text' column of a df_val dataframe, and the val_labels variable is set to the 'label' column of the same dataframe. The encoding_process function is then called on the val_content variable and the result is assigned to the val_get_ids variable. The val_labels variable is converted into a tensor using the torch.tensor function:

# make sure return_dict is not default
return_dict = False

# Training dataset
content = df_train['text'].values
labels = df_train['label'].values
get_ids = encoding_process(content)
labels = torch.tensor(labels)

# Validation dataset
val_content = df_val['text'].values
val_labels = df_val['label'].values
val_get_ids = encoding_process(val_content)
val_labels = torch.tensor(val_labels)

STEP.4 Preparation: Define the variables before fine-tuning 💦

This code is setting the number of training epochs to 3 and the batch size to 16. It is then creating two PyTorch TensorDataset objects from the input data get_ids and labels, and two PyTorch DataLoader objects to load this data in batches. It is also using the BertForSequenceClassification model from the transformers library and specifying some of its hyperparameters, such as the number of output labels and whether or not to output attention and hidden states. The model is then moved to a GPU for acceleration and an AdamW optimizer is defined for training the model. The code is also specifying an output file path for the trained model and setting up a learning rate scheduler to adjust the learning rate during training:

# Bert-based-model
# reference
# https://huggingface.co/transformers/model_doc/bert.html
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2, output_attentions=False, output_hidden_states=False)
model.cuda()
optimizer = AdamW(model.parameters(), lr=2e-5)
output_model = './content/model/imdb_bert.pth'
total_steps = len(train_dataloader) * epochs
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)

# save
def save(model, optimizer):
# save
torch.save({
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict()
}, output_model)

# reference
# https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

def accuracy_calc(preds, labels):

pre = np.argmax(preds, axis=1).flatten()
real = labels.flatten()
return accuracy_score(real, pre)

def f1_accuracy(preds, labels):

pre = np.argmax(preds, axis=1).flatten()
real = labels.flatten()
return f1_score(real, pre)

✍ STEP.5 Fine-tuning BERT model for our dataset 💦

This code is training a model using the train_dataloader and val_dataloader to iterate over the training and validation datasets, respectively. The training loop consists of the following steps:

  1. Set the model to training mode using model.train()
  2. Initialize some variables for storing the total loss, evaluation accuracy, and F1 scores for the training and validation sets.
  3. For each batch of data in the training set:
  4. Zero the gradients of the model using model.zero_grad()
  5. Pass the batch to the model and calculate the loss using the model() function and the loss criterion.
  6. Backpropagate the loss and update the model’s parameters using the optimizer.
  7. Clip the gradients to prevent them from getting too large.
  8. Calculate the F1 score for the training batch.
  9. Set the model to evaluation mode using model.eval()
  10. For each batch of data in the validation set:
  11. Pass the batch to the model and calculate the loss using the model() function and the loss criterion, but this time do not update the model's parameters.
  12. Calculate the evaluation accuracy and F1 score for the validation batch.
  13. Calculate the average training loss, evaluation accuracy, and F1 scores for the epoch.
  14. Print the training loss, validation loss, evaluation accuracy, and F1 scores to the console.
  15. Save the model and optimizer using the save() function.

This training loop will repeat for a specified number of epochs. At the end of each epoch, the model’s performance on the training and validation sets will be printed to the console and the model will be saved.

# 💥 IMPORTANT: Please create the directory in your environment, 
# such like './content/model/', in order to save your model in your local!
for epoch in range(epochs):
model.train()
total_loss, total_val_loss = 0, 0
total_eval_accuracy = 0
_f1 = 0
_train_f1 = 0
for step, batch in enumerate(train_dataloader):
model.zero_grad()
loss, tval_ = model(batch[0].to(device), token_type_ids=None, attention_mask=(batch[0]>0).to(device), labels=batch[1].to(device),return_dict = False)
total_loss += loss.item()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
scheduler.step()
tval_ = tval_.detach().cpu().numpy()
label_ids = batch[1].to('cpu').numpy()
_train_f1 += f1_accuracy(tval_, label_ids)

model.eval()
for i, batch in enumerate(val_dataloader):
with torch.no_grad():
loss, val_ = model(batch[0].to(device), token_type_ids=None, attention_mask=(batch[0]>0).to(device), labels=batch[1].to(device),return_dict = False)

total_val_loss += loss.item()

val_ = val_.detach().cpu().numpy()
label_ids = batch[1].to('cpu').numpy()
total_eval_accuracy += accuracy_calc(val_, label_ids)
_f1 += f1_accuracy(val_, label_ids)

training_loss = total_loss / len(train_dataloader)
valid_loss = total_val_loss / len(val_dataloader)
_accuracy = total_eval_accuracy / len(val_dataloader)
_f1_score = _f1 / len(val_dataloader)
train_f1_score = _train_f1/ len(train_dataloader)

print('Training loss is', training_loss)
print('Valid loss is:', valid_loss)
print('Acc score is:', _accuracy)
print('F1_score is:', _f1_score)
print('train_F1_score is:', train_f1_score)
print('\n')

save(model, optimizer)

In this blog, we introduced fine-tuning the BERT model for IMDB Review Sentiment Analysis. We discussed the importance of pre-processing and how to properly encode the input token. We also went over how to fine-tune the BERT model and showed how to train the model, evaluate its performance, and save it. With this blog, we hope to have provided a comprehensive guide to fine-tuning BERT for sentiment analysis.

If you want the full code, please let me know! 👋

--

--