Finetune DistilBERT for multi-label text classsification task

Dhaval Taunk
Analytics Vidhya
Published in
6 min readSep 17, 2020
Source — https://developer.nvidia.com/blog/efficient-bert-finding-your-optimal-model-with-multimetric-bayesian-optimization-part-1/

In one of my last blog post, How to fine-tune bert on text classification task, I had explained fine-tuning BERT for a multi-class text classification task. In this post, I will be explaining how to fine-tune DistilBERT for a multi-label text classification task. I have made a GitHub repo as well containing the complete code which is explained below. You can visit the below link to see it and can fork it and use it.

https://github.com/DhavalTaunk08/Transformers_scripts

Introduction

The DistilBERT model (https://arxiv.org/pdf/1910.01108.pdf) was released by Huggingface.co which is a distilled version of BERT released by Google (https://arxiv.org/pdf/1810.04805.pdf).

According to the authors:-

They leverage knowledge distillation during the pre-training phase and show that it is possible to reduce the size of a BERT model by 40% while retaining 97% of its language understanding capabilities and being 60% faster.

So let’s start with the details and the process to fine-tune the model.

Multi-Class v/s Multi-Label classification

First of all, it is important to understand the difference between multi-class and multi-label classification. Multi-class classification means classifying the samples into one of the three or more available classes. While in multi-label classification, one sample can belong to more than one class. Let me explain it more clearly by an example:-

Multiclass classification — Let say we have 10 fruits. They can belong to one of the three classes — ‘apple’, ‘mango’ and ‘banana’. If we are asked to classify the fruits in these given classes, they can belong to only one of these classes. Therefore, it is a multi-class classification problem.

Multi-label classification — Let say we have few movie names and our task is to classify these movies into the genres to which they belong to like ‘action’, ‘comedy’, ‘horror’, ‘sci-fi’, ‘drama’ etc. These movies can belong to more than one genre. For example — ‘The Matrix movie series belongs to the ‘action’ as well as ‘sci-fi’ category. Thus it is called multi-label classification.

Data Formatting

First of all, there is a need to format the data. The required data can contain 2 columns. One column containing text to be classified. Another column containing labels related to that sample. The below image is an example of the data frame:-

The above example shows that we have six different classes and the sample can belong to any number of classes.

But the question is how to convert the labels into this format? Here, scikit-learn comes to the rescue!!!

Below is an example of how to convert these labels to the required format.

>>> from sklearn.preprocessing import MultiLabelBinarizer
>>> mlb = MultiLabelBinarizer()
>>> mlb.fit_transform([{'sci-fi', 'thriller'}, {'comedy'}])
array([[0, 1, 1],
[1, 0, 0]])
>>> list(mlb.classes_)
['comedy', 'sci-fi', 'thriller']

Also, you can refer to the below link to get more details about it.

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html

Code

Now let’s get to the code part about the required libraries, how to write DataLoader, and model class for this task.

Required libraries

transformers==3.0.2

torch

scikit-learn

numpy

pandas

tqdm

These can be installed with the pip install’ command.

Importing libraries

import numpy as np
import pandas as pd
import transformers
import torch
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
from transformers import DistilBertModel, DistilBertTokenizer
from tqdm import tqdm
from sklearn.preprocessing import MultiLabelBinarizer
from torch import cuda
device = 'cuda' if cuda.is_available() else 'cpu'

The above step is to set up the device for GPU.

Training parameters

MAX_LEN = 256
TRAIN_BATCH_SIZE = 8
VALID_BATCH_SIZE = 4
EPOCHS = 1
LEARNING_RATE = 1e-05

These parameters can be tuned according to one’s needs. But there is one important point to be noted here:-

DistilBERT accepts a max_sequence_length of 512 tokens.

We cannot give max_sequence_length more than this. If you want to give a sequence length of size more than 512 tokens, you can try the longformer model (https://arxiv.org/pdf/2004.05150)

DataLoader

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
class MultiLabelDataset(Dataset):
def __init__(self, dataframe, tokenizer, max_len):
self.tokenizer = tokenizer
self.data = dataframe
self.text = dataframe.text
self.targets = self.data.labels
self.max_len = max_len

def __len__(self):
return len(self.text)

def __getitem__(self, index):
text = str(self.text[index])
text = " ".join(text.split())

inputs = self.tokenizer.encode_plus(
text,
None,
add_special_tokens=True,
max_length=self.max_len,
pad_to_max_length=True,
return_token_type_ids=True
)
ids = inputs['input_ids']
mask = inputs['attention_mask']

return {
'ids': torch.tensor(ids, dtype=torch.long),
'mask': torch.tensor(mask, dtype=torch.long),
'targets': torch.tensor(self.targets[index], dtype=torch.float)
}

Calling the tokenizer and loading the dataset. Here, train_dataset and val_dataset will be training and validation datasets in pandas data frame format with column names as [‘text’, ‘labels’].

training_set = MultiLabelDataset(train_dataset, tokenizer, MAX_LEN)
testing_set = MultiLabelDataset(test_dataset, tokenizer, MAX_LEN)
train_params = {'batch_size': TRAIN_BATCH_SIZE,
'shuffle': True,
'num_workers': 0
}

test_params = {'batch_size': VALID_BATCH_SIZE,
'shuffle': True,
'num_workers': 0
}

training_loader = DataLoader(training_set, **train_params)
testing_loader = DataLoader(testing_set, **test_params)

The above step converts the data into the required format using the MultiLabelDataset class and PyTorch's DataLoader. You can read more about DataLoader by visiting the below-given link:-

https://pytorch.org/tutorials/beginner/data_loading_tutorial.html

Model Class

class DistilBERTClass(torch.nn.Module):
def __init__(self):
super(DistilBERTClass, self).__init__()
self.l1 = DistilBertModel.from_pretrained("distilbert-base-uncased")
self.pre_classifier = torch.nn.Linear(768, 768)
self.dropout = torch.nn.Dropout(0.3)
self.classifier = torch.nn.Linear(768, num_classes)

def forward(self, input_ids, attention_mask):
output_1 = self.l1(input_ids=input_ids, attention_mask=attention_mask)
hidden_state = output_1[0]
pooler = hidden_state[:, 0]
pooler = self.pre_classifier(pooler)
pooler = torch.nn.ReLU()(pooler)
pooler = self.dropout(pooler)
output = self.classifier(pooler)
return output

Here, I have used 2 linear layers on top of the DistilBERT model with a dropout unit and ReLu as an activation function. num_classes will be the number of classes available in your dataset. The model will return the logit scores for each class. The class can be called by the below method:-

model = DistilBERTClass()
model.to(device)

Loss function and optimizer

def loss_fn(outputs, targets):
return torch.nn.BCEWithLogitsLoss()(outputs, targets)
optimizer = torch.optim.Adam(params =  model.parameters(), lr=LEARNING_RATE)

Here, BCEWithLogitsLoss is used which is used generally for multi-label classification. One can read more by visiting the below link:-

https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html

Training function

def train_model(epoch):
model.train()
for _, data in enumerate(training_loader, 0):
ids = data['ids'].to(device, dtype = torch.long)
mask = data['mask'].to(device, dtype = torch.long)
targets = data['targets'].to(device, dtype = torch.float)

outputs = model(ids, mask)

optimizer.zero_grad()
loss = loss_fn(outputs, targets)
if _%1000==0:
print(f'Epoch: {epoch}, Loss: {loss.item()}')

optimizer.zero_grad()
loss.backward()
optimizer.step()
for epoch in range(EPOCHS):
train_model(epoch)

The above function is used for training the model for the specified number of epochs.

Validation

def validation(testing_loader):
model.eval()
fin_targets=[]
fin_outputs=[]
with torch.no_grad():
for _, data in enumerate(testing_loader, 0):
ids = data['ids'].to(device, dtype = torch.long)
mask = data['mask'].to(device, dtype = torch.long)
token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
targets = data['targets'].to(device, dtype = torch.float)
outputs = model(ids, mask, token_type_ids)
fin_targets.extend(targets.cpu().detach().numpy().tolist())
fin_outputs.extend(torch.sigmoid(outputs).cpu().detach().numpy().tolist())
return fin_outputs, fin_targets
outputs, targets = validation(testing_loader)
outputs = np.array(outputs) >= 0.5
accuracy = metrics.accuracy_score(targets, outputs)
f1_score_micro = metrics.f1_score(targets, outputs, average='micro')
f1_score_macro = metrics.f1_score(targets, outputs, average='macro')
print(f"Accuracy Score = {accuracy}")
print(f"F1 Score (Micro) = {f1_score_micro}")
print(f"F1 Score (Macro) = {f1_score_macro}")

Here I have used accuracy and f1_score for now. But usually, the hamming loss and hamming score are the better metrics for calculating loss and accuracy for multilabel classification tasks. I will be discussing that in my next post.

So this is it for now. Stay tuned for the next post for more details on hamming loss, score, and other things. If you want to read more, you can visit my profile for more posts.

--

--

Dhaval Taunk
Analytics Vidhya

MS by Research @IIITH, Ex Data Scientist @ Yes Bank | Former Intern @ Haptik, IIT Guwahati | Machine Learning | Deep Learning | NLP