Fine Tuning BERT with CoLA Dataset

Ashwin N
12 min readJan 17, 2023

Think of the original Transformer as a model built with LEGO® bricks. The construction set contains bricks such as encoders, decoders, embedding layers, positional encoding methods, multi-head attention layers, masked multi-head attention layers, post-layer normalization, feed-forward sub-layers, and linear output layers. The bricks come in various sizes and forms. You can spend hours building all sorts of models using the same building kit! Some constructions will only require some of the bricks. Other constructions will add a new piece, just like when we obtain additional bricks for a model built using LEGO® components.

Figure 1: BERT Peak

BERT added a new piece to the Transformer building kit: a bidirectional multi-head attention sub-layer. Because BERT is a departure from the LSTM-based approaches to NLP, hence we can place BERT at the top of the Transformer and eventually calling it as “BERT Peak”.

When we humans are having problems understanding a sentence, we do not just look at the past words. BERT, like us, looks at all the words in the same sentence at the same time.

In this chapter, we will first explore the architecture of Bidirectional Encoder Representations from Transformers (BERT). BERT only uses the blocks of the encoders of the Transformer in a novel way and does not use the decoder stack.

Then we will fine-tune a pretrained BERT model. The BERT model we will fine-tune was trained by a third party and uploaded to Hugging Face. Transformers can be pretrained. Then, a pretrained BERT, for example, can be fine-tuned on several NLP tasks. We will go through this fascinating experience of downstream Transformer usage using Hugging Face modules.

This article will cover following topics:

  • Preparing the pretraining environment
  • Defining pretraining encoder layers
  • Defining fine-tuning
  • Downstream multitasking
  • Building a fine-tuning BERT model
  • Loading an accessibility judgement dataset
  • Creating attention masks
  • BERT model configuration
  • Measuring the performance of the fine-tuned model

CoLA Dataset

We will fine-tune a BERT model to predict the downstream task of Acceptability Judgements and measure the predictions with the Matthews Correlation Coefficient (MCC). The BERT model we will fine-tune will be trained on The Corpus of Linguistic Acceptability (CoLA).

Activating the GPU

Pretraining a multi-head attention transformer model requires the parallel processing GPUs can provide.

import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != ‘/device:GPU:0’:
raise SystemError(‘GPU device not found’)
print(‘Found GPU at: {}’.format(device_name))

# Found GPU at: /device:GPU:0

Installing the Hugging Face PyTorch interface for BERT

Hugging Face provides a pretrained BERT model. Hugging Face developed a base class named PreTrainedModel. By installing this class, we can load a model from a pretrained model configuration.

Hugging Face provides modules in TensorFlow and PyTorch. I recommend that a developer feels comfortable with both environments.

We will install the modules required as follows:

# !pip install -q transformers
import transformers
import torch
import datetime
from torch.utils.data import TensorDataset, DataLoader, RandomSampler,
SequentialSampler
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertConfig, BertModel
from transformers import AdamW, BertForSequenceClassification,
get_linear_schedule_with_warmup
from tqdm import tqdm, trange

Let us import other common libraries

import pandas as pd
import io
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go

Specifying CUDA as the device for torch

We will now specify that torch uses the Compute Unified Device Architecture (CUDA) to put the parallel computing power of the NVIDIA card to work for our multi-head attention model.

1. Loading Dataset

We will now load the CoLA based on the Warstadt et al. (2018) paper. More info on this dataset would be found on https://huggingface.co/datasets/shivkumarganesh/CoLA

df = pd.read_csv(“/kaggle/input/cola-public/in_domain_train.tsv”, 
delimiter=’\t’, header=None,
names=[‘sentence_source’, ‘label’, ‘label_notes’, ‘sentence’])

df.shape

# (8551, 4)

Each sample in the .tsv files contains four tab-separated columns:

  • sentence_source: the source of the sentence (code)
  • label: the label (0=unacceptable, 1=acceptable)
  • label_notes: the label annotated by the author
  • sentence: the sentence to be classified

Creating sentences, label lists, and adding BERT tokens:

# Adding special tokens to the sentences
sentences = [“[CLS] “ + sentence + “ [SEP]” for sentence in df.sentence.values]
labels = df.label.values

2. Activating the BERT Tokenizer

In this section, we will initialize a pretrained BERT tokenizer. This will save the time it would take to train it from scratch.

The program selects an uncased tokenizer, activates it, and displays the first tokenized sentence:

tokenizer = BertTokenizer.from_pretrained(‘bert-base-uncased’, 
do_lower_case=True)
tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]

print ("Tokenize the first sentence:")
print (tokenized_texts[0])

# Tokenize the first sentence:
# ['[CLS]', 'our', 'friends', 'won', "'", 't', 'buy', 'this', 'analysis', ',', 'let', 'alone', 'the', 'next', 'one', 'we', 'propose', '.', '[SEP]']

3. Processing the data

# Now let us find the longest sentence in the dataset
# and use that as the maximum length for padding
max_len = 0
for sent in tokenized_texts:
if len(sent) > max_len:
max_len = len(sent)

print(“Maximum length is: {}”.format(max_len))

# Maximum length is: 47

The longest sequence in our training set is 47, but we’ll leave room on the end anyway. Set maximum length of a sequence to 128 and the sequences are padded.

Let’s convert tokens to IDs:

max_len = 128

# Use the BERT tokenizer to convert the tokens to their index numbers
# in the BERT vocabulary
input_ids = [tokenizer.convert_tokens_to_ids(x) for x in tokenized_texts]

# Pad our input tokens
input_ids = pad_sequences(input_ids, maxlen=max_len, dtype="long",
truncating="post", padding="post")

Creating attention masks

Now comes a tricky part of the process. We padded the sequences in the previous cell. But we want to prevent the model from performing attention on those padded tokens!

The idea is to apply a mask with a value of 1 for each token, which will be followed by 0s for padding:

# Create attention masks
attention_masks = []

# Create a mask of 1s for each token followed by 0s for padding
for seq in input_ids:
seq_mask = [float(i>0) for i in seq]
attention_masks.append(seq_mask)

Splitting and converting to Torch Tensors

train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(input_ids, 
labels, random_state=42, test_size=0.1)

train_masks, validation_masks, _, _ = train_test_split(attention_masks,
input_ids, random_state=42, test_size=0.1)

The data is ready to be trained, but it still needs to be adapted to torch.

# Convert all of our data into torch tensors, the required datatype for our model
train_inputs = torch.tensor(train_inputs)
validation_inputs = torch.tensor(validation_inputs)
train_labels = torch.tensor(train_labels)
validation_labels = torch.tensor(validation_labels)
train_masks = torch.tensor(train_masks)
validation_masks = torch.tensor(validation_masks)

Selecting a batch size and creating an iterator

Now, the program selects a batch size and creates an iterator. The iterator is a clever way of avoiding a loop that would load all the data in memory. The iterator, coupled with the torch DataLoader, can batch train huge datasets without crashing the memory of the machine.

In this model, the batch size is 32:

# Select a batch size for training. For fine-tuning BERT on a specific task,
# the authors recommend a batch size of 16 or 32
batch_size = 32

# Create an iterator of our data with torch DataLoader.
# This helps save on memory during training because, unlike a for loop,
# with an iterator the entire dataset does not need to be loaded into memory
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler,
batch_size=batch_size)

validation_data = TensorDataset(validation_inputs,
validation_masks, validation_labels)

validation_sampler = SequentialSampler(validation_data)

validation_dataloader = DataLoader(validation_data,
sampler=validation_sampler,
batch_size=batch_size)

The data has been processed and is all set. The program can now load and configure the BERT model.

4. Loading the Hugging Face BERT uncased base model

# Load BertForSequenceClassification, the pretrained BERT model.
model = BertForSequenceClassification.from_pretrained(“bert-base-uncased”,
num_labels=2).to(device)
# model.cuda()

Optimizer grouped parameters

Initialize the optimizer for the model’s parameters. Finetuning a model begins with initializing the pretrained model parameter values (not their names).

The parameters of the optimizer include a weight decay rate to avoid overfitting, and some parameters are filtered.

The goal is to prepare the model’s parameters for the training loop:

# Optimize model parameters
param_optimizer = list(model.named_parameters())
no_decay = [‘bias’, ‘LayerNorm.weight’]

# Separate the ‘weight’ parameters from the ‘bias’ parameters.
# — For the ‘weight’ parameters, this specifies a ‘weight_decay_rate’
# of 0.01.
# — For the ‘bias’ parameters, the ‘weight_decay_rate’ is 0.0.
optimizer_grouped_parameters = [
# Filter for all parameters which are not bias or LayerNorm.
{‘params’: [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
‘weight_decay_rate’: 0.1},
# Filter for parameters which are bias or LayerNorm.
{‘params’: [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
‘weight_decay_rate’: 0.0}
]

The hyperparameters for the training loop

The hyperparameters for the training loop are critical, though they seem innocuous. Adam will activate weight decay and also go through a warm-up phase, for example.

The learning rate (lr) and warm-up rate (warmup) should be set to a very small value early in the optimization phase and gradually increase after a certain number of iterations. This avoids large gradients and overshooting the optimization goals.

Some researchers argue that the gradients at the output level of the sub-layers before layer normalization do not require a warm-up rate. Solving this problem requires many experimental runs.

epochs = 10

optimizer = AdamW(optimizer_grouped_parameters,
lr=5e-5,
eps=1e-8)

# Total number of training steps is number of batches * number of epochs.
total_steps = len(train_dataloader) * epochs

# Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(optimizer,
num_warmup_steps=0,
num_training_steps=total_steps)

Accuracy measurement helper function

def flat_accuracy(preds, labels):
pred_flat = np.argmax(preds, axis=1).flatten()
labels_flat = labels.flatten()
return np.sum(pred_flat == labels_flat) / len(labels_flat)

# Function to calculate the time elapsed
def format_time(elapsed):
# Round to the nearest second.
elapsed_rounded = int(round((elapsed)))
# Format as hh:mm:ss
return str(datetime.timedelta(seconds=elapsed_rounded))

The data is ready, the parameters are ready. It’s time to activate the training loop!

The training loop

The training loop follows standard learning processes. The number of epochs is set to 4, and there is a measurement for loss and accuracy, which will be plotted. The training loop uses the dataloader load and train batches. The training process is measured and evaluated.

The code starts by initializing the train_loss_set, which will store the loss and accuracy, which will be plotted. It starts training its epochs and runs a standard training loop, as shown in the following excerpt:

#The Training Loop
t = []

# Store our loss and accuracy for plotting
train_loss_set = []

# trange is a tqdm wrapper around the normal python range
for _ in trange(epochs, desc=”Epoch”):

# Training

# Set our model to training mode (as opposed to evaluation mode)
model.train()

# Tracking variables
tr_loss = 0
nb_tr_examples, nb_tr_steps = 0, 0

# Train the data for one epoch
for step, batch in enumerate(train_dataloader):
# Add batch to GPU
batch = tuple(t.to(device) for t in batch)
# batch = tuple(t for t in batch)
# Unpack the inputs from our dataloader
b_input_ids, b_input_mask, b_labels = batch
# Clear out the gradients (by default they accumulate)
optimizer.zero_grad()
# Forward pass
outputs = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
loss = outputs[‘loss’]
train_loss_set.append(loss.item())
# Backward pass
loss.backward()
# Update parameters and take a step using the computed gradient
optimizer.step()

# Update the learning rate.
scheduler.step()

# Update tracking variables
tr_loss += loss.item()
nb_tr_examples += b_input_ids.size(0)
nb_tr_steps += 1

print("Train loss: {}".format(tr_loss/nb_tr_steps))

# Validation
# Put model in evaluation mode to evaluate loss on the validation set
model.eval()

# Tracking variables
eval_loss, eval_accuracy = 0, 0
nb_eval_steps, nb_eval_examples = 0, 0

# Evaluate data for one epoch
for batch in validation_dataloader:
# Add batch to GPU
batch = tuple(t.to(device) for t in batch)
# Unpack the inputs from our dataloader
b_input_ids, b_input_mask, b_labels = batch
# Telling the model not to compute or store gradients, saving memory and speeding up validation
with torch.no_grad():
# Forward pass, calculate logit predictions
logits = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)

# Move logits and labels to CPU
logits = logits['logits'].detach().cpu().numpy()
label_ids = b_labels.to('cpu').numpy()

tmp_eval_accuracy = flat_accuracy(logits, label_ids)

eval_accuracy += tmp_eval_accuracy
nb_eval_steps += 1

print("Validation Accuracy: {}".format(eval_accuracy/nb_eval_steps))
###

Epoch: 0%| | 0/10 [00:00<?, ?it/s]
Train loss: 0.4866324247660973
Epoch: 10%|█ | 1/10 [01:31<13:47, 91.95s/it]
Validation Accuracy: 0.8032407407407407
Train loss: 0.2726411976198438
Epoch: 20%|██ | 2/10 [03:02<12:09, 91.13s/it]
Validation Accuracy: 0.8063271604938271
Train loss: 0.15303814302186997
Epoch: 30%|███ | 3/10 [04:32<10:35, 90.81s/it]
Validation Accuracy: 0.8097993827160493
Train loss: 0.08949965936483437
Epoch: 40%|████ | 4/10 [06:03<09:03, 90.65s/it]
Validation Accuracy: 0.8171296296296297
Train loss: 0.05705505336878273
Epoch: 50%|█████ | 5/10 [07:33<07:32, 90.55s/it]
Validation Accuracy: 0.8206018518518519
Train loss: 0.043270988243968224
Epoch: 60%|██████ | 6/10 [09:04<06:02, 90.51s/it]
Validation Accuracy: 0.8182870370370371
Train loss: 0.02780074113690105
Epoch: 70%|███████ | 7/10 [10:34<04:31, 90.46s/it]
Validation Accuracy: 0.8179012345679012
Train loss: 0.023910069168657372
Epoch: 80%|████████ | 8/10 [12:04<03:00, 90.44s/it]
Validation Accuracy: 0.8240740740740741
Train loss: 0.013467846731983774
Epoch: 90%|█████████ | 9/10 [13:35<01:30, 90.42s/it]
Validation Accuracy: 0.8225308641975309
Train loss: 0.008648735464669386
Epoch: 100%|██████████| 10/10 [15:05<00:00, 90.56s/it]
Validation Accuracy: 0.8213734567901234
###

The model is trained. We can now display the training evaluation.

Training Evaluation

The loss and accuracy values were stored in train_loss_set as defined at the beginning of the training loop.

plt.figure(figsize=(15, 8))
plt.title(“Training Loss”)
plt.xlabel(“Loss per Batch”)
plt.ylabel(“Loss”)
plt.plot(train_loss_set)
plt.show()

The model has been fine-tuned. We can now run predictions.

4. Prediction and Evaluation on Hold-out dataset

The BERT downstream model was trained with the in_domain_train.tsv dataset. The program will now make predictions using the holdout dataset contained in the out_of_domain_dev.tsv file. The goal is to predict whether sentence is grammatically correct.

test_df = pd.read_csv(“/kaggle/input/cola-public/out_of_domain_dev.tsv”, 
delimiter=”\t”, header=None,
names = [‘sentence_source’, ‘label’, ‘label_notes’, ‘sentence’])

# Create sentence and label lists
sentences = test_df.sentence.values

# We need to add special tokens at the beginning and end of each
# sentence for BERT to work properly
sentences = ["[CLS] " + sentence + " [SEP]" for sentence in sentences]
labels = test_df.label.values
tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]

../...

The program then runs batch predictions using the dataloader:

# Put model in evaluation mode
model.eval()

# Tracking variables
predictions, true_labels = [], []

# Predict
for batch in prediction_dataloader:
# Add batch to GPU
batch = tuple(t.to(device) for t in batch)

# unpack the inputs from dataloader
b_input_ids, b_input_mask, b_labels = batch

with torch.no_grad():
logits = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)

# Move logits and labels to CPU
logits = logits['logits'].detach().cpu().numpy()
label_ids = b_labels.to('cpu').numpy()

# Store predictions and true labels
predictions.append(logits)
true_labels.append(label_ids)

Evaluating using Matthews Correlation Coefficient

The Matthews Correlation Coefficient (MCC) was initially designed to measure the quality of binary classifications and can be modified to be a multi-class correlation coefficient. A two-class classification can be made with four probabilities at each prediction:

  • TP = True Positive
  • TN = True Negative
  • FP = False Positive
  • FN = False Negative

Brian W. Matthews, a biochemist, designed it in 1975, inspired by his predecessors’ phi function. Since then it has evolved into various formats such as the following one:

The value produced by MCC is between -1 and +1. +1 is the maximum positive value of a prediction. -1 is an inverse prediction. 0 is an average random prediction.
GLUE evaluates Linguistic Acceptability with MCC.

MCC is imported from sklearn.metrics:

# Import and evaluate each test batch using Matthew’s correlation 
# coefficient
from sklearn.metrics import matthews_corrcoef

matthews_set = []

for i in range(len(true_labels)):
matthews = matthews_corrcoef(true_labels[i],
np.argmax(predictions[i], axis=1).flatten())
matthews_set.append(matthews)

matthews_set

###
[0.049286405809014416,
-0.050964719143762556,
0.4040950971038548,
0.30508307783296046,
0.2321726094326961,
0.5510387687779837,
0.4879500364742666,
0.6831300510639733,
0.9229582069908973,
0.7704873741021288,
1.0,
0.8333333333333334,
0.8150678894028793,
0.7948717948717948,
0.4547940268270977,
0.6457765999379483,
0.0]
###

The output produces MCC values between -1 and +1 as expected. Almost all the MCC values are positive, which is good news. Let’s see what the evaluation is for the whole dataset.

# Flatten the predictions and true values for aggregate Matthew’s
# evaluation on the whole dataset

flat_predictions = [item for sublist in predictions for item in sublist]
flat_predictions = np.argmax(flat_predictions, axis=1).flatten()
flat_true_labels = [item for sublist in true_labels for item in sublist]

matthews_corrcoef(flat_true_labels, flat_predictions)

# 0.5657382497020534

The output confirms that the MCC is positive, which shows that there is a correlation for this model and dataset.

5. Summary

BERT brings bidirectional attention to transformers. Predicting sequences from left to right and masking the future tokens to train a model has serious limitations. If the masked sequence contains the meaning we are looking for, the model will produce errors. BERT attends to all of the tokens of a sequence at the same time.

We explored the architecture of BERT, which only uses the encoder stack of transformers. BERT was designed as a two-step framework. The first step of the framework is to pretrain a model. The second step is to fine-tune the model. We built a fine-tuning BERT model for an Acceptability Judgement downstream task. The finetuning process went through all phases of the process. First, we loaded the dataset and loaded the necessary pretrained modules of the model. Then the model was trained, and its performance measured.

Fine-tuning a pretrained model takes fewer machine resources than training downstream tasks from scratch. Fine-tuned models can perform a variety of tasks. BERT proves that we can pretrain a model on two tasks only, which is remarkable in itself. But producing a multitask fine-tuned model based on the trained parameters of the BERT pretrained model is extraordinary.

6. References

  1. The complete implementation will be found in https://www.kaggle.com/ashwinnaidu/finetuningbertwithcoladataset
  2. For few other info: https://github.com/PacktPublishing

--

--

Ashwin N

Lead Data Scientist 🧙‍♂️ | Exploring the AI Wonderland 🔬 | Sharing Insights on Data Science 📊 | Join me in https://medium.com/@ashwinnaidu1991