Fine-Tuning Transformers with custom dataset: Classification task

14 min readFeb 11, 2023

While pre-trained transformer models have many benefits, there are also several drawbacks to using them on a custom dataset compared to using fine-tuned models. They may not always perform as well as fine-tuned models on a custom dataset. Fine-tuning allows for greater control, adaptation, and optimization to the specific task and dataset, resulting in improved performance.

What is fine-tuning in transformers?

Fine-tuning is a process in which a pre-trained model is further trained on a new task using task-specific data. In the context of Transformer models, fine-tuning refers to the process of using a pre-trained Transformer model as the starting point for training on a new task.

The idea behind fine-tuning Transformer models is that they have already been trained on a large corpus of text data, and therefore have already learned many useful representations of language. By fine-tuning the model on a new task, the model can use these pre-learned representations as a good starting point, and learn task-specific information from the new task data.

The process of fine-tuning a Transformer model involves unfreezing some or all of the layers of the pre-trained model and training them on the new task data using a task-specific loss function. The remaining layers can be kept frozen, preserving the pre-learned representations and preventing overfitting on the small task-specific data.

In conclusion, fine-tuning is a powerful technique for leveraging pre-trained Transformer models for new NLP tasks, allowing practitioners to achieve state-of-the-art results with relatively small amounts of task-specific data. Fine-tuning has become a popular approach in NLP due to the high performance of Transformer models and the availability of large pre-trained models.

How we fine-tune transformers models for a specific task?

Fine-tuning a Transformer model for a specific task typically involves the following steps:

Prepare the task-specific data
Tokenize the data
Choose a pre-trained model
Define a fine-tuning architecture
Compile the model
Train the model
Evaluate the model

By following these steps, you can fine-tune a Transformer model for a specific task, leveraging the pre-learned representations of the model to achieve high performance with limited task-specific data.

Approaches for fine-tuning architecture

Chop off the final layer and add a new one:

This approach is often used when the task for which the transformer model is being fine-tuned is different from the task for which it was pre-trained. In this approach, the final layer of the pre-trained transformer model is removed, and a new layer is added to match the specific requirements of the target task. The new layer is then trained from scratch on the target task’s data, while the rest of the model is kept frozen. The idea behind this approach is to preserve the learned representations of the pre-trained model, which can be useful for the target task, and only update the final layer to make predictions for the new task.

2. Fine-tune everything:

In this approach, all the parameters of the pre-trained transformer model are updated during training on the target task’s data. This is usually done when the target task is similar to the task for which the model was pre-trained, and the pre-trained model’s learned representations can be fine-tuned for the target task. During fine-tuning, a smaller learning rate is often used to avoid undoing the learned representations from the pre-training step. This approach can lead to better performance than the previous one as the entire model is optimized for the target task.

Both of these approaches have their own advantages and disadvantages, and the choice between them depends on the specific use case and the similarities between the target task and the pre-training task.

In this task we are using Fine-tune everything approach.

In this task, we are fine-tuning a model for sentiment analysis, but the same procedure can be used for any classification task.

Fine-tuning for sentiment analysis on a custom dataset

You will get the all code here. https://github.com/GaneshLokare/Transformers

# install transformers
!pip install transformers

The command pip install transformers is used to install the transformers package, which provides access to state-of-the-art Transformer-based models for NLP tasks, including sentiment analysis.

Once the transformers package is installed, you can import and use the Transformer-based models in your own projects

1. Prepare the task-specific data

Gather and prepare the annotated data for the specific task, such as text classification, sentiment analysis, or named entity recognition. This data will be used for fine-tuning the pre-trained model.

Download dataset

# download data from provided link
!wget -nc https://www.dropbox.com/s/lkd0eklmi64m9xm/AirlineTweets.csv?dl=0

Import required libraries

# import required libraries
import pandas as pd
import numpy as np
import seaborn as sns

import torch

from sklearn.metrics import f1_score, accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split

Load dataset

df = pd.read_csv('AirlineTweets.csv?dl=0')

check how data looks like

df.head()

df.info()

Keep required columns only

df = df[['airline_sentiment','text']]

check data again

As we can see, it has the 2 columns which we have selected.

Check the distribution of classes

df['airline_sentiment'].hist()

As we can see this is an imbalanced distribusion of classes. We will see if our model can handle imbalanced dataset or it will biased towards majority class.

Map classes to the integers

target_map = { 'positive': 1, 'negative': 0, 'neutral': 2}
df['target'] = df['airline_sentiment'].map(target_map)

The first line defines a dictionary target_map that maps the original categorical target variable 'sentiment' to a numerical representation, with 'positive' mapped to 1, 'negative' mapped to 0 and ‘neutral’ mapped to 2.

The second line applies this mapping to the ‘airline_sentiment’ column of the DataFrame df using the map method and saves the result as a new column 'target' in the same DataFrame. This can be useful for training machine learning models, which often require numerical input variables.

Save data to new csv file. Because transformers required special format of dataset to perform operations on it, which we will give using load_dataset class. We will see next how dataset required for the transformers.

df1 = df[['text','target']]
df1.columns = ['sentence','label']
df1.to_csv('data.csv', index = False)

df1 = df[[‘text’, ‘target’]]: This line selects the “text” and “target” columns from the data frame df and assigns them to a new data frame df1.
df1.columns = [‘sentence’, ‘label’]: This line renames the columns in df1 to “sentence” and “label”. As transformers must required ‘label’ as a target column name. Otherwise it will raise an error.
df1.to_csv(‘data.csv’, index=False): This line saves the data frame df1 as a CSV file named “data.csv”. The “index” argument is set to False, which means that the index of the data frame will not be saved to the CSV file.

The resulting “data.csv” file will contain two columns, “sentence” and “label”, which are the pre-processed features for the text sequence and target label, respectively.

!pip install datasets

The “!pip install datasets” command installs the “datasets” library, which provides a unified API for accessing a variety of publicly available datasets for natural language processing tasks such as sentiment analysis, machine translation, and summarization.

“from datasets import load_dataset” imports the “load_dataset” function from the “datasets” library.
“raw_dataset = load_dataset(‘csv’, data_files = ‘data.csv’)” uses the “load_dataset” function to load a dataset stored in a CSV file named “data.csv”, which we have stored above.

Check how loaded dataset looks like

raw_dataset

The “DatasetDict” is a dictionary-like object that contains one dataset named “train”. We can one or more datasets.

The “Dataset” object represents a single dataset and provides information about the features and structure of the data. The “features” attribute is a list of strings that specifies the names of the features in the dataset. In this case, the dataset has two features: “sentence” and “label”.

The “num_rows” attribute specifies the number of rows (examples) in the dataset. In this case, the “train” dataset has 14640 rows.

Split dataset into train and test

split = raw_dataset['train'].train_test_split(test_size=0.3, seed=42)

“raw_dataset[‘train’]” accesses the “train” dataset from the “raw_dataset” object.
“.train_test_split(test_size=0.3, seed=42)” uses the “train_test_split” method of the “Dataset” class to split the “train” dataset into training and test sets.

The “test_size” argument is a float that specifies the proportion of the dataset to be used for testing. In this case, 0.3 means that 30% of the data will be used for testing, and 70% will be used for training.

The “seed” argument is an integer that sets the random seed for the split. This ensures that the split is deterministic and reproducible.

Check what we have got back

split

We have got test set.

How to handle multiple files

Below 2 codes are only for muliple files and if we have both train and test sets.

# if we have multiple csv files
raw_dataset = load_dataset('csv', data_files = ['file1.csv','file2.csv'])

If we already have train test split

raw_dataset = load_dataset('csv', 
              data_files = { 'train': ['train1.csv','train2.csv'],
                             'test': 'test.csv'})

2. Tokenize the data

Convert the task-specific data into a numerical representation suitable for input into the Transformer model. This typically involves tokenizing the text into subwords or words, mapping the tokens to integers, and encoding the input as a tensor.

To get more information about Tokenization, please read this

# Import AutoTokenizer and create tokenizer object
from transformers import AutoTokenizer
checkpoint = 'bert-base-cased'
tokernizer = AutoTokenizer.from_pretrained(checkpoint)

The code is using the AutoTokenizer class from the transformers library to load a pre-trained tokenizer for the BERT model with the "base" architecture and the "cased" version. The pre-trained tokenizer will be used to convert input sequences of text into numerical representations (tokens) that can be fed into the model. The checkpoint variable specifies the name of the pre-trained tokenizer to use, and the from_pretrained method is used to load the tokenizer from the transformers library's pre-trained models.

Define tokenizer function

def tokenize_fn(batch):
  return tokernizer(batch['sentence'], truncation = True)

“def tokenize_fn(batch):” defines the function “tokenize_fn”, which takes a single argument “batch”. The “batch” argument is expected to be a dictionary-like object that contains the text data to be tokenized.
“return tokernizer(batch[‘sentence’], truncation = True)” returns the result of applying a tokenization function, “tokernizer”, to the “sentence” feature of the “batch” data. The “truncation” argument is set to “True”, which means that the tokenization function will truncate sequences that are longer than the maximum length specified by the model.

tokenized_dataset = split.map(tokenize_fn, batched = True)

“split.map(tokenize_fn, batched = True)” applies the “tokenize_fn” function to each example in the “split” dataset, which was obtained by splitting the “raw_dataset” into training and test sets.
“batched = True” specifies that the tokenization function should be applied to batches of data, rather than to individual examples. This can improve performance by allowing the tokenization to be parallelized.
After that we will get tokenized data which we can directly feed to our model.

for more details on Tokenization, please check below link.

Preparing Text Data for Transformers: Tokenization, Mapping and Padding

From now, we won’t just be making predictions but training our own models on our datasets as well.

medium.com

3. Choose a pre-trained model

Select a pre-trained Transformer model that is well-suited for the task. There are many pre-trained models available in popular NLP libraries such as Hugging Face Transformers, TensorFlow Hub, or the AllenNLP library. Choose a model with a good balance between the size of the model and the task complexity.

Import classification model

from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments

The code imports three classes from the transformers library: AutoModelForSequenceClassification, Trainer, and TrainingArguments.

AutoModelForSequenceClassification is a class from the transformers library that implements a sequence classification model, a type of model that is used to predict the class of a sequence of inputs (e.g., a sentence). It uses the AutoModel architecture, which automatically selects the most suitable model architecture for the given task and data.

Trainer is a class that provides a high-level API for training a machine learning model. It can be used to train a model using any torch.nn.Module instance, including models implemented using the transformers library.

TrainingArguments is a class that defines the arguments used to configure a training run. It includes arguments such as the number of training steps, the learning rate, the batch size, and many others. When using the Trainer class, an instance of TrainingArguments is passed to the constructor to specify the configuration for a training run.

4. Define a fine-tuning architecture

Decide which layers of the pre-trained model to fine-tune, and add additional layers as needed to perform the specific task. For example, for a text classification task, a dense layer with a softmax activation function may be added on top of the pre-trained model to produce class predictions.

Load model

model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels = 3)

If you have a binary classification or sentiment analysis problem then use “num_labels” = 2.

Install the torchinfo library.

!pip install torchinfo

torchinfo is a Python library for getting information about PyTorch models and tensors. It provides a convenient way to inspect the architecture of a PyTorch model, including the shape and size of the tensors that are passed between the layers, as well as the number of parameters and memory usage of each layer.

print the model summary

from torchinfo import summary
summary(model)

Model has more than 108 M parameters and all they are trainable.

5. Compile the model

Choose a loss function and an optimizer suitable for the task, and compile the model by defining the training and evaluation procedures.

training_args = TrainingArguments(output_dir='training_dir',
                                  evaluation_strategy='epoch',
                                  save_strategy='epoch',
                                  num_train_epochs=3,
                                  per_device_train_batch_size=16,
                                  per_device_eval_batch_size=64,
                                  )

The above code creates an object of the TrainingArguments class with specified arguments for training a model. The output_dir argument sets the directory where the model and training-related files will be saved. The evaluation_strategy argument sets how often evaluation should be done, and in this case, it's set to be done every epoch.

The save_strategy argument sets when the model should be saved, and it's set to be saved every epoch. The num_train_epochs argument sets the number of training epochs, and it's set to 3.

The per_device_train_batch_size argument sets the batch size for training, and it's set to 16. The per_device_eval_batch_size argument sets the batch size for evaluation, and it's set to 64.

Define evaluation metrics, which we will pass during training

def compute_metrics(logits_and_labels):
  logits, labels = logits_and_labels
  predictions = np.argmax(logits, axis=-1)
  acc = np.mean(predictions == labels)
  f1 = f1_score(labels, predictions, average = 'micro')
  return {'accuracy': acc, 'f1_score': f1}

The above code defines a function compute_metrics that takes a tuple of logits_and_labels as input and computes two evaluation metrics: accuracy and F1 score.

The function first unpacks the tuple into logits and labels. Then it calculates the predictions using np.argmax along the last axis of logits. The accuracy is calculated as the mean of the equality of predictions and labels. The F1 score is calculated using the f1_score function from scikit-learn with average='micro'. The accuracy and F1 score are returned as a dictionary.

6. Train the model

Train the model on the task-specific data using a suitable number of epochs, and monitor the performance of the model on a validation set. If necessary, adjust the model architecture or training procedure, and repeat the training process until satisfactory performance is achieved.

trainer = Trainer(model,
                  training_args,
                  train_dataset = tokenized_dataset["train"],
                  eval_dataset = tokenized_dataset["test"],
                  tokenizer=tokernizer,
                  compute_metrics=compute_metrics)

The above code creates an object of the Trainer class and assigns it to the variable trainer. This class is used for training and evaluating a machine learning model.

The Trainer class takes several arguments:

model is the model to be trained.
training_args is an instance of the TrainingArguments class that contains the arguments for training the model.
train_dataset is the training dataset, which is assigned tokenized_dataset["train"] in this case.
eval_dataset is the evaluation dataset, which is assigned tokenized_dataset["test"] in this case.
tokenizer is the tokenizer used for the input data, and it's assigned tokernizer in this case.
compute_metrics is the function used to compute the evaluation metrics, which is compute_metrics in this case.

With these arguments, the Trainer object is configured to train and evaluate the specified model using the specified datasets, tokenizer, and evaluation metrics.

Call trainer object and train the model

trainer.train()

As we have set 3 epochs, we have got 3 evaluation metrics. Notice our model is overfitting as validation loss is increases for each epochs. So one epoch will be sufficient for this dataset as we are getting 83% accuracy as well as f1 score. Next I will explain how to do more fine-tuning to improve accuracy.

After training , training data will be saved at training_dir

! ls training_dir

Output: checkpoint-1282, checkpoint-1923, checkpoint-641 runs

As we can see there are 3 checkpoints, we will select first model as it has highest performance.

7. Evaluate the model

Evaluate the performance of the fine-tuned model on a test set, and compare it to other models or baselines.

Import pipeline

from transformers import pipeline

The above code imports the pipeline class from the transformers library. The pipeline class is a high-level API for using pre-trained models for a variety of tasks, such as text classification, sequence labeling, and generation.

saved_model = pipeline('text-classification',
                       model = 'training_dir/checkpoint-1282')

The above code creates an instance of the pipeline class and assigns it to the variable saved_model. The pipeline class is used to perform text classification tasks.

The pipeline class takes two arguments:

'text-classification' is the task being performed, in this case text classification.
model is the model to be used, which is 'training_dir/checkpoint-1282' in this case.

This code creates a text classification pipeline that uses the specified pre-trained model stored in the directory 'training_dir/checkpoint-1282'. With this pipeline, it's possible to perform text classification tasks using the pre-trained model with just a few lines of code.

Get test set

split['test']

Get predictions

predictions = saved_model(split['test']['sentence'])

printout few predictions

predictions[:10]

we are getting list of dictionaries. Next we will write a function to get labels only.

def get_label(d):
  return int(d['label'].split('_')[1])
predictions = [get_label(d) for d in predictions]

The function returns the integer value that is obtained by splitting the string value stored under the key ‘label’ in the dictionary, using the split method. The returned integer is the second element of the resulting list obtained after splitting the string, which is actually a predicted label.

The second line uses a list comprehension to apply the get_label function to each element in the list predictions and returns a new list with the results.

Once we got the labels, Calculate accuracy using accuracy_score function.

print("acc:",accuracy_score(split['test']['label'], predictions))

acc: 0.8376593806921676

On test set we have got 83.76% accuracy.

Calculate f1 score

print("f1:",f1_score(split['test']['label'], predictions, average = 'macro'))

f1: 0.7837583256590012

On test set we have got 0.78 f1 score.

Plot confusion matrix

# create function for plotting confusion matrix
def plot_cm(cm):
  classes = ['negative','positive','neutral']
  df_cm = pd.DataFrame(cm, index=classes, columns=classes)
  ax = sns.heatmap(df_cm, annot = True, fmt='g')
  ax.set_xlabel('Predicted')
  ax.set_ylabel('Actual')

cm = confusion_matrix(split['test']['label'],predictions, normalize = 'true')
plot_cm(cm)

As we know, our dataset has a higher proportion of negative classes, and according to the confusion matrix, it appears that the model has a slight bias towards the negative class.

Same procedure can be used for any classification task.

Next, I will demonstrate how to train and fine-tune transformer models for other tasks on custom datasets.

If you enjoyed this post, please follow me and be sure to check out some of my other blog posts for more insights and information. I’m constantly exploring new topics and writing about my findings, so there’s always something new and interesting to discover.

Ganesh Lokare - Medium

Read writing from Ganesh Lokare on Medium. Data Scientist Github:https://github.com/GaneshLokare. Every day, Ganesh…