A Gentle Introduction to implementing BERT using Hugging Face!

Rajat Bhatnagar
Analytics Vidhya
Published in
8 min readMay 31, 2020

In this article, I’m going to share my learnings of implementing Bidirectional Encoder Representations from Transformers (BERT) using the Hugging face library. BERT is a state of the art model developed by Google for different Natural language Processing (NLP) tasks. In this post, we are going to build a sentiment analysis classifier using the Stanford Treebank Dataset, which contains positive and negative sentences for a movie review. We will be using the BertForSequenceClassification module from the Hugging Face library for this purpose. The code is available at https://github.com/rajatbhatnagar94/bert_getting_started.

Quick Notes

To understand this article completely, I highly recommend that you have a basic understanding of Pytorch concepts. I found this very helpful for understanding the basics of it. I found this article useful for understanding the basics of the internal working of the BERT model.

Downloading and Saving the data set

For simplicity, I have uploaded the data in the Github repository here. You can download the training, development and testing sets from here and save in any directory (Example data/<dataset_type>.csv in my case) or directly clone the repository. The dataset is pretty small, so it would not take us very long to train or clone the repository.

Let’s Begin!

Let’s import all the libraries which will be used throughout the tutorial. Instructions for installing the libraries are given in the README.md.

import os
import pandas as pd
import torch
import transformers
import sklearn
from transformers import BertTokenizer,BertForSequenceClassification
from IPython.core.display import display, HTML

Read all the datasets

The training, testing, and development datasets can be imported as follows:

dataset = {
"name": "Stanford treebank",
"train_path": "data/train.csv",
"dev_path": "data/dev.csv",
"test_path": "data/test.csv",
'classes': ['neg', 'pos']
}
def read_data():
train = pd.read_csv(dataset['train_path'], sep='\t')
dev = pd.read_csv(dataset['dev_path'], sep='\t')
test = pd.read_csv(dataset['test_path'], '\t')
return train, dev, test
train, dev, test = read_data()

Making Batches and using Data loader

Once we have the raw data we need to do two things before training the data:

  1. Tokenize the text sentences and convert them to vectorized form
  2. Creating batches of the vectorized tokens using DataLoader for training, development and test set

Tokenize the text sentences and convert them to vectorized form

Convert the data into the format which we’ll be passing to the BERT Model. For this we will use the tokenizer.encode_plus function provided by hugging face. First we define the tokenizer. We’ll be using the BertTokenizer for this.

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

We’ll be passing two variables to the BERT’s forward function later, namely, input_ids and attention_mask. The input_ids are simply the numeric representations of the tokens.Attention_mask is useful when we add padding to the input tokens. The attention mask tells us which input_ids correspond to padding. Padding is added because we want all the input sentences to be of the same length (at least for a batch) so that we are able to form tensor objects properly. We will use the tokenizer.encode_plus function for obtaining input_ids, attention_mask.

def encode(data, tokenizer):
input_ids = []
attention_mask = [] for text in data: tokenized_text = tokenizer.encode_plus(text,
max_length=128,
add_special_tokens = True,
pad_to_max_length=True,
padding_side='right',
return_attention_mask=True)
input_ids.append(tokenized_text['input_ids']) attention_mask.append(tokenized_text['attention_mask'])

return torch.tensor(input_ids, dtype=torch.long), torch.tensor(attention_mask, dtype=torch.long)

The above encode function will iterate over all sentences and for each sentence — tokenize the text, truncate or add padding to make it of length 128, add special tokens ([CLS], [SEP], [PAD]) and also return the attention_mask. We will be needing all of these to be passed on to the forward function of the BERT classifier.

Creating batches of the vectorized tokens using DataLoader for training, development and testing set

The second thing we need to do before passing the input to the train function is to make batches of the dataset. We create a function get_batches which calls the above encode function to create batches. We use TensorDataset and DataLoader function of Pytorch for this task.

def get_batches(df, tokenizer, batch_size=2):    x = list(df['text'].values)

y_indices = df['classification'].apply(lambda each_y: dataset['classes'].index(each_y))

y = torch.tensor(list(y_indices), dtype=torch.long)
input_ids, attention_mask = encode(x, tokenizer) tensor_dataset = torch.utils.data.TensorDataset(input_ids, attention_mask, y) tensor_randomsampler = torch.utils.data.RandomSampler(tensor_dataset) tensor_dataloader = torch.utils.data.DataLoader(tensor_dataset, sampler=tensor_randomsampler, batch_size=batch_size) return tensor_dataloader

Making Batches for train, test and dev sets:

batch_train = get_batches(train, tokenizer, batch_size=2)
batch_dev = get_batches(dev, tokenizer, batch_size=2)
batch_test = get_batches(test, tokenizer, batch_size=2)

Now we have the batches batch_train, batch_dev, batch_test. Each element of the batches is a tuple that contains input_ids (batch_size x max_sequence_length), attention_mask (batch_size x max_sequence_length) and labels (batch_size x number_of_labels which are required for training our model!

Writing the Train function

Now we are all set to train our model. This train function is just like how we process a normal Pytorch model. We first set the mode to training, then we iterate through each batch and transfer it to the GPU. Then we pass the input_ids, attention_mask and input_ids to the model. It gives us the output, which consists of loss, logits, hidden_states_output and attention_mask_output. The loss contains the classification loss value. We call the backward function of the loss to calculate the gradients of the parameters of the BERT model. We then call clip_grad_norm_ to prevent the gradients from getting too high or too low. Then we call the optimizer.step() to update the gradients which are calculated by loss.backward(). scheduler.step() is used to update the learning rate according to the scheduler.

def train_model(batch, model, optimizer, scheduler, epochs, device):    model.train()  # Set the mode to training    for e in range(epochs):        for i, batch_tuple in enumerate(batch):            batch_tuple = (t.to(device) for t in batch_tuple)            input_ids, attention_mask, labels = batch_tuple            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)            loss, logits, hidden_states_output, attention_mask_output = outputs            if i % 100 == 0:
print("loss - {0}, iteration - {1}/{2}".format(loss, e + 1, i))
model.zero_grad() optimizer.zero_grad() loss.backward() torch.nn.utils.clip_grad_norm_(model.parameters(),
parameters['max_grad_norm'])
optimizer.step() scheduler.step()

Evaluation

The evaluation function is similar to the train_model function we had written earlier. We set the mode of the model to eval. We then iterate over each batch and execute the forward function of the model under torch.no_grad(). The torch.no_grad() ensures that this time we are not calculating the gradients. We obtain a similar output as we obtained in the training step. We will make use of the logits variable to get the prediction. The logits variable contains the prediction for each class without the Softmax. We will simply find out the argmax of the logits to get the predicted label. One more interesting output field is the attention_mask_output. The attention_mask_output contains the “attention” at each layer. It is a tuple of size 12 which represents the 12 layers of the BERT Model. Each tuple consists of the attention tensor of shape (batch_size (2), number_of_heads (12), max_sequence_length (128), max_sequence_length (128)). We select the last layer and last head of the attention_mask_output and return the values corresponding to the [CLS] token. The values corresponding to the [CLS] token gives us an overall impression of the importance value given to each of the tokens in the sentence which led to the prediction of the classifier. This is a Tensor of shape(batch_size(2), max_sequence_length(128)). The values for each batch add to 1 and the value of the padded tokens is 0. We will be using this to display the attention layers at the end.

def evaluate(batch, model, device):    input_ids, predictions, true_labels, attentions = [], [], [], []    model.eval()    for i, batch_cpu in enumerate(batch):        batch_gpu = (t.to(device) for t in batch_cpu)        input_ids_gpu, attention_mask, labels = batch_gpu        with torch.no_grad():            loss, logits, hidden_states_output, attention_mask_output = model(input_ids=input_ids_gpu, attention_mask=attention_mask, labels=labels)            logits =  logits.cpu()            prediction = torch.argmax(logits, dim=1).tolist()            true_label = labels.cpu().tolist()            input_ids_cpu = input_ids_gpu.cpu().tolist()            attention_last_layer = attention_mask_output[-1].cpu() # selection the last attention layer            attention_softmax = attention_last_layer[:,-1, 0].tolist()  # selection the last head attention of CLS token            input_ids += input_ids_cpu            predictions += prediction            true_labels += true_label            attentions += attention_softmax    return input_ids, predictions, true_labels, attentions

Finally, Running the Code

Now we have all the functions for executing our code. We define some hyperparameters like the number of epochs, learning rate, warmup steps, number of training steps, and max_grad_norm. We also initiate our model from BertForSequenceClassification and move it to the device which we defined in the beginning. We also define the optimizer and scheduler with the hyperparameters chosen above.

epochs=2parameters = {
'learning_rate': 2e-5,
'num_warmup_steps': 1000,
'num_training_steps': len(batch_train) * epochs,
'max_grad_norm': 1
}
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2, output_hidden_states=True, output_attentions=True)model.to(device)optimizer = transformers.AdamW(model.parameters(),
lr=parameters['learning_rate'], correct_bias=False)
scheduler = transformers.get_linear_schedule_with_warmup(optimizer,
num_warmup_steps=parameters['num_warmup_steps'],
num_training_steps=parameters['num_training_steps'])

Train the model. This shouldn’t take a lot of time as the amount of data in the training set is less. It took around 8 minutes for me with 1 GPU. We are printing the loss at every 100th iteration. You can also calculate the interim accuracy on the development set along with it by calling the evaluate function but I have avoided that in this basic implementation.

train_model(batch_train, model, optimizer, scheduler, epochs, device)

After training the model we can evaluate the development and testing set. We show the classification report using the sklearn.metrics after evaluating on the development set.

input_ids, predictions, true_labels, attentions = evaluate(batch_dev, model, device)print(sklearn.metrics.classification_report(true_labels, predictions))
Results for Stanford Treebank Dataset using BERT classifier

With very little hyperparameter tuning we get an F1 score of 92 %. The score can be improved by using different hyperparameters, optimizer and schedulers. We will not be discussing those in this article.

So this was a very basic way to build a sentiment classifier using BERT. You can follow a similar approach to explore different tasks which can be solved using BERT.

The next sections show the visualization of the importance given to each of the token in the sentence by the [CLS] token. This gives us some information about — why did the model made a particular prediction and what were the most important tokens in the sentence.

Visualization of Attention Layer

When I initially made this classifier, I was not able to find out why a prediction was made. To visualize the predictions you can use various methods. LIME is a pretty famous library to do that, but it is very slow. I have used the attention layer, which was obtained earlier, for visualization. I have highlighted the tokens with the intensity of their attention scores. Few examples are shown in the image below.

def get_length_without_special_tokens(sentence):
length = 0
for i in sentence:
if i == 0:
break
else:
length += 1
return length
def print_attention(input_ids_all, attentions_all, tokenizer):
for input_ids, attention in zip(input_ids_all, attentions_all):
html = []
len_input_ids = get_length_without_special_tokens(input_ids)
input_ids = input_ids[:len_input_ids]
attention = attention[:len_input_ids]
for input_id, attention_value in zip(input_ids, attention):
token = tokenizer.convert_ids_to_tokens(input_id)
attention_value = attention_value
html.append('<span style="background-color: rgb(255,255,0,{0})">{1}</span>'.format(10 * attention_value, token))
html_string = " ".join(html)
display(HTML(html_string))
print_attention(input_ids, attentions, tokenizer)
Highlighting the important tokens which are given by the attention layer

Saving and Loading the Model

It is important to save the model so that you don’t have to train it again. You can call the model.save_pretrained and tokenizer.save_pretrained function to save the model and the tokenizer respectively. You can reload the model by giving the same path in the BertForSequenceClassification.from_pretrained('/path') while initializing the model.

def save(model, tokenizer):
output_dir = './output'
if not os.path.exists(output_dir):
os.makedirs(output_dir)
print("Saving model to {}".format(output_dir))
model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
save(model, tokenizer)

Conclusion

That’s it! You have successfully implemented a simple BERT classifier for classifying a movie review as positive or negative. This was a very basic implementation to just let you get started. Hope you enjoyed it.

--

--

Rajat Bhatnagar
Analytics Vidhya

Graduate CS student @ University of Colorado Boulder