Fine-Tuning Transformers with custom dataset: Classification task
While pre-trained transformer models have many benefits, there are also several drawbacks to using them on a custom dataset compared to using fine-tuned models. They may not always perform as well as fine-tuned models on a custom dataset. Fine-tuning allows for greater control, adaptation, and optimization to the specific task and dataset, resulting in improved performance.
What is fine-tuning in transformers?
Fine-tuning is a process in which a pre-trained model is further trained on a new task using task-specific data. In the context of Transformer models, fine-tuning refers to the process of using a pre-trained Transformer model as the starting point for training on a new task.
The idea behind fine-tuning Transformer models is that they have already been trained on a large corpus of text data, and therefore have already learned many useful representations of language. By fine-tuning the model on a new task, the model can use these pre-learned representations as a good starting point, and learn task-specific information from the new task data.
The process of fine-tuning a Transformer model involves unfreezing some or all of the layers of the pre-trained model and training them on the new task data using a task-specific loss function. The remaining layers can be kept frozen, preserving the pre-learned representations and preventing overfitting on the small task-specific data.
In conclusion, fine-tuning is a powerful technique for leveraging pre-trained Transformer models for new NLP tasks, allowing practitioners to achieve state-of-the-art results with relatively small amounts of task-specific data. Fine-tuning has become a popular approach in NLP due to the high performance of Transformer models and the availability of large pre-trained models.
How we fine-tune transformers models for a specific task?
Fine-tuning a Transformer model for a specific task typically involves the following steps:
- Prepare the task-specific data
- Tokenize the data
- Choose a pre-trained model
- Define a fine-tuning architecture
- Compile the model
- Train the model
- Evaluate the model
By following these steps, you can fine-tune a Transformer model for a specific task, leveraging the pre-learned representations of the model to achieve high performance with limited task-specific data.
Approaches for fine-tuning architecture
- Chop off the final layer and add a new one:
This approach is often used when the task for which the transformer model is being fine-tuned is different from the task for which it was pre-trained. In this approach, the final layer of the pre-trained transformer model is removed, and a new layer is added to match the specific requirements of the target task. The new layer is then trained from scratch on the target task’s data, while the rest of the model is kept frozen. The idea behind this approach is to preserve the learned representations of the pre-trained model, which can be useful for the target task, and only update the final layer to make predictions for the new task.
2. Fine-tune everything:
In this approach, all the parameters of the pre-trained transformer model are updated during training on the target task’s data. This is usually done when the target task is similar to the task for which the model was pre-trained, and the pre-trained model’s learned representations can be fine-tuned for the target task. During fine-tuning, a smaller learning rate is often used to avoid undoing the learned representations from the pre-training step. This approach can lead to better performance than the previous one as the entire model is optimized for the target task.
Both of these approaches have their own advantages and disadvantages, and the choice between them depends on the specific use case and the similarities between the target task and the pre-training task.
In this task we are using Fine-tune everything approach.
In this task, we are fine-tuning a model for sentiment analysis, but the same procedure can be used for any classification task.
Fine-tuning for sentiment analysis on a custom dataset
You will get the all code here. https://github.com/GaneshLokare/Transformers
# install transformers
!pip install transformers
The command pip install transformers
is used to install the transformers
package, which provides access to state-of-the-art Transformer-based models for NLP tasks, including sentiment analysis.
Once the transformers
package is installed, you can import and use the Transformer-based models in your own projects
1. Prepare the task-specific data
Gather and prepare the annotated data for the specific task, such as text classification, sentiment analysis, or named entity recognition. This data will be used for fine-tuning the pre-trained model.
Download dataset
# download data from provided link
!wget -nc https://www.dropbox.com/s/lkd0eklmi64m9xm/AirlineTweets.csv?dl=0
Import required libraries
# import required libraries
import pandas as pd
import numpy as np
import seaborn as sns
import torch
from sklearn.metrics import f1_score, accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
Load dataset
df = pd.read_csv('AirlineTweets.csv?dl=0')
check how data looks like
df.head()
df.info()
Keep required columns only
df = df[['airline_sentiment','text']]
check data again
As we can see, it has the 2 columns which we have selected.
Check the distribution of classes
df['airline_sentiment'].hist()
As we can see this is an imbalanced distribusion of classes. We will see if our model can handle imbalanced dataset or it will biased towards majority class.
Map classes to the integers
target_map = { 'positive': 1, 'negative': 0, 'neutral': 2}
df['target'] = df['airline_sentiment'].map(target_map)
The first line defines a dictionary target_map
that maps the original categorical target variable 'sentiment' to a numerical representation, with 'positive' mapped to 1, 'negative' mapped to 0 and ‘neutral’ mapped to 2.
The second line applies this mapping to the ‘airline_sentiment’ column of the DataFrame df
using the map
method and saves the result as a new column 'target' in the same DataFrame. This can be useful for training machine learning models, which often require numerical input variables.
Save data to new csv file. Because transformers required special format of dataset to perform operations on it, which we will give using load_dataset class. We will see next how dataset required for the transformers.
df1 = df[['text','target']]
df1.columns = ['sentence','label']
df1.to_csv('data.csv', index = False)
- df1 = df[[‘text’, ‘target’]]: This line selects the “text” and “target” columns from the data frame df and assigns them to a new data frame df1.
- df1.columns = [‘sentence’, ‘label’]: This line renames the columns in df1 to “sentence” and “label”. As transformers must required ‘label’ as a target column name. Otherwise it will raise an error.
- df1.to_csv(‘data.csv’, index=False): This line saves the data frame df1 as a CSV file named “data.csv”. The “index” argument is set to False, which means that the index of the data frame will not be saved to the CSV file.
The resulting “data.csv” file will contain two columns, “sentence” and “label”, which are the pre-processed features for the text sequence and target label, respectively.
!pip install datasets
The “!pip install datasets” command installs the “datasets” library, which provides a unified API for accessing a variety of publicly available datasets for natural language processing tasks such as sentiment analysis, machine translation, and summarization.
- “from datasets import load_dataset” imports the “load_dataset” function from the “datasets” library.
- “raw_dataset = load_dataset(‘csv’, data_files = ‘data.csv’)” uses the “load_dataset” function to load a dataset stored in a CSV file named “data.csv”, which we have stored above.
Check how loaded dataset looks like
raw_dataset
The “DatasetDict” is a dictionary-like object that contains one dataset named “train”. We can one or more datasets.
The “Dataset” object represents a single dataset and provides information about the features and structure of the data. The “features” attribute is a list of strings that specifies the names of the features in the dataset. In this case, the dataset has two features: “sentence” and “label”.
The “num_rows” attribute specifies the number of rows (examples) in the dataset. In this case, the “train” dataset has 14640 rows.
Split dataset into train and test
split = raw_dataset['train'].train_test_split(test_size=0.3, seed=42)
- “raw_dataset[‘train’]” accesses the “train” dataset from the “raw_dataset” object.
- “.train_test_split(test_size=0.3, seed=42)” uses the “train_test_split” method of the “Dataset” class to split the “train” dataset into training and test sets.
The “test_size” argument is a float that specifies the proportion of the dataset to be used for testing. In this case, 0.3 means that 30% of the data will be used for testing, and 70% will be used for training.
The “seed” argument is an integer that sets the random seed for the split. This ensures that the split is deterministic and reproducible.
Check what we have got back
split
We have got test set.
How to handle multiple files
Below 2 codes are only for muliple files and if we have both train and test sets.
# if we have multiple csv files
raw_dataset = load_dataset('csv', data_files = ['file1.csv','file2.csv'])
If we already have train test split
raw_dataset = load_dataset('csv',
data_files = { 'train': ['train1.csv','train2.csv'],
'test': 'test.csv'})
2. Tokenize the data
Convert the task-specific data into a numerical representation suitable for input into the Transformer model. This typically involves tokenizing the text into subwords or words, mapping the tokens to integers, and encoding the input as a tensor.
To get more information about Tokenization, please read this
# Import AutoTokenizer and create tokenizer object
from transformers import AutoTokenizer
checkpoint = 'bert-base-cased'
tokernizer = AutoTokenizer.from_pretrained(checkpoint)
The code is using the AutoTokenizer
class from the transformers
library to load a pre-trained tokenizer for the BERT model with the "base" architecture and the "cased" version. The pre-trained tokenizer will be used to convert input sequences of text into numerical representations (tokens) that can be fed into the model. The checkpoint
variable specifies the name of the pre-trained tokenizer to use, and the from_pretrained
method is used to load the tokenizer from the transformers
library's pre-trained models.
Define tokenizer function
def tokenize_fn(batch):
return tokernizer(batch['sentence'], truncation = True)
- “def tokenize_fn(batch):” defines the function “tokenize_fn”, which takes a single argument “batch”. The “batch” argument is expected to be a dictionary-like object that contains the text data to be tokenized.
- “return tokernizer(batch[‘sentence’], truncation = True)” returns the result of applying a tokenization function, “tokernizer”, to the “sentence” feature of the “batch” data. The “truncation” argument is set to “True”, which means that the tokenization function will truncate sequences that are longer than the maximum length specified by the model.
tokenized_dataset = split.map(tokenize_fn, batched = True)
- “split.map(tokenize_fn, batched = True)” applies the “tokenize_fn” function to each example in the “split” dataset, which was obtained by splitting the “raw_dataset” into training and test sets.
- “batched = True” specifies that the tokenization function should be applied to batches of data, rather than to individual examples. This can improve performance by allowing the tokenization to be parallelized.
- After that we will get tokenized data which we can directly feed to our model.
for more details on Tokenization, please check below link.
3. Choose a pre-trained model
Select a pre-trained Transformer model that is well-suited for the task. There are many pre-trained models available in popular NLP libraries such as Hugging Face Transformers, TensorFlow Hub, or the AllenNLP library. Choose a model with a good balance between the size of the model and the task complexity.
Import classification model
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments
The code imports three classes from the transformers
library: AutoModelForSequenceClassification
, Trainer
, and TrainingArguments
.
AutoModelForSequenceClassification
is a class from the transformers
library that implements a sequence classification model, a type of model that is used to predict the class of a sequence of inputs (e.g., a sentence). It uses the AutoModel
architecture, which automatically selects the most suitable model architecture for the given task and data.
Trainer
is a class that provides a high-level API for training a machine learning model. It can be used to train a model using any torch.nn.Module
instance, including models implemented using the transformers
library.
TrainingArguments
is a class that defines the arguments used to configure a training run. It includes arguments such as the number of training steps, the learning rate, the batch size, and many others. When using the Trainer
class, an instance of TrainingArguments
is passed to the constructor to specify the configuration for a training run.
4. Define a fine-tuning architecture
Decide which layers of the pre-trained model to fine-tune, and add additional layers as needed to perform the specific task. For example, for a text classification task, a dense layer with a softmax activation function may be added on top of the pre-trained model to produce class predictions.
Load model
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels = 3)
If you have a binary classification or sentiment analysis problem then use “num_labels” = 2.
Install the torchinfo
library.
!pip install torchinfo
torchinfo
is a Python library for getting information about PyTorch models and tensors. It provides a convenient way to inspect the architecture of a PyTorch model, including the shape and size of the tensors that are passed between the layers, as well as the number of parameters and memory usage of each layer.
print the model summary
from torchinfo import summary
summary(model)
Model has more than 108 M parameters and all they are trainable.
5. Compile the model
Choose a loss function and an optimizer suitable for the task, and compile the model by defining the training and evaluation procedures.
training_args = TrainingArguments(output_dir='training_dir',
evaluation_strategy='epoch',
save_strategy='epoch',
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=64,
)
The above code creates an object of the TrainingArguments
class with specified arguments for training a model. The output_dir
argument sets the directory where the model and training-related files will be saved. The evaluation_strategy
argument sets how often evaluation should be done, and in this case, it's set to be done every epoch.
The save_strategy
argument sets when the model should be saved, and it's set to be saved every epoch. The num_train_epochs
argument sets the number of training epochs, and it's set to 3.
The per_device_train_batch_size
argument sets the batch size for training, and it's set to 16. The per_device_eval_batch_size
argument sets the batch size for evaluation, and it's set to 64.
Define evaluation metrics, which we will pass during training
def compute_metrics(logits_and_labels):
logits, labels = logits_and_labels
predictions = np.argmax(logits, axis=-1)
acc = np.mean(predictions == labels)
f1 = f1_score(labels, predictions, average = 'micro')
return {'accuracy': acc, 'f1_score': f1}
The above code defines a function compute_metrics
that takes a tuple of logits_and_labels
as input and computes two evaluation metrics: accuracy and F1 score.
The function first unpacks the tuple into logits
and labels
. Then it calculates the predictions using np.argmax
along the last axis of logits
. The accuracy is calculated as the mean of the equality of predictions
and labels
. The F1 score is calculated using the f1_score
function from scikit-learn with average='micro'
. The accuracy and F1 score are returned as a dictionary.
6. Train the model
Train the model on the task-specific data using a suitable number of epochs, and monitor the performance of the model on a validation set. If necessary, adjust the model architecture or training procedure, and repeat the training process until satisfactory performance is achieved.
trainer = Trainer(model,
training_args,
train_dataset = tokenized_dataset["train"],
eval_dataset = tokenized_dataset["test"],
tokenizer=tokernizer,
compute_metrics=compute_metrics)
The above code creates an object of the Trainer
class and assigns it to the variable trainer
. This class is used for training and evaluating a machine learning model.
The Trainer
class takes several arguments:
model
is the model to be trained.training_args
is an instance of theTrainingArguments
class that contains the arguments for training the model.train_dataset
is the training dataset, which is assignedtokenized_dataset["train"]
in this case.eval_dataset
is the evaluation dataset, which is assignedtokenized_dataset["test"]
in this case.tokenizer
is the tokenizer used for the input data, and it's assignedtokernizer
in this case.compute_metrics
is the function used to compute the evaluation metrics, which iscompute_metrics
in this case.
With these arguments, the Trainer
object is configured to train and evaluate the specified model using the specified datasets, tokenizer, and evaluation metrics.
Call trainer object and train the model
trainer.train()
As we have set 3 epochs, we have got 3 evaluation metrics. Notice our model is overfitting as validation loss is increases for each epochs. So one epoch will be sufficient for this dataset as we are getting 83% accuracy as well as f1 score. Next I will explain how to do more fine-tuning to improve accuracy.
After training , training data will be saved at training_dir
! ls training_dir
Output: checkpoint-1282, checkpoint-1923, checkpoint-641 runs
As we can see there are 3 checkpoints, we will select first model as it has highest performance.
7. Evaluate the model
Evaluate the performance of the fine-tuned model on a test set, and compare it to other models or baselines.
Import pipeline
from transformers import pipeline
The above code imports the pipeline
class from the transformers
library. The pipeline
class is a high-level API for using pre-trained models for a variety of tasks, such as text classification, sequence labeling, and generation.
saved_model = pipeline('text-classification',
model = 'training_dir/checkpoint-1282')
The above code creates an instance of the pipeline
class and assigns it to the variable saved_model
. The pipeline
class is used to perform text classification tasks.
The pipeline
class takes two arguments:
'text-classification'
is the task being performed, in this case text classification.model
is the model to be used, which is'training_dir/checkpoint-1282'
in this case.
This code creates a text classification pipeline that uses the specified pre-trained model stored in the directory 'training_dir/checkpoint-1282'
. With this pipeline, it's possible to perform text classification tasks using the pre-trained model with just a few lines of code.
Get test set
split['test']
Get predictions
predictions = saved_model(split['test']['sentence'])
printout few predictions
predictions[:10]
we are getting list of dictionaries. Next we will write a function to get labels only.
def get_label(d):
return int(d['label'].split('_')[1])
predictions = [get_label(d) for d in predictions]
The function returns the integer value that is obtained by splitting the string value stored under the key ‘label’ in the dictionary, using the split
method. The returned integer is the second element of the resulting list obtained after splitting the string, which is actually a predicted label.
The second line uses a list comprehension to apply the get_label
function to each element in the list predictions
and returns a new list with the results.
Once we got the labels, Calculate accuracy using accuracy_score function.
print("acc:",accuracy_score(split['test']['label'], predictions))
acc: 0.8376593806921676
On test set we have got 83.76% accuracy.
Calculate f1 score
print("f1:",f1_score(split['test']['label'], predictions, average = 'macro'))
f1: 0.7837583256590012
On test set we have got 0.78 f1 score.
Plot confusion matrix
# create function for plotting confusion matrix
def plot_cm(cm):
classes = ['negative','positive','neutral']
df_cm = pd.DataFrame(cm, index=classes, columns=classes)
ax = sns.heatmap(df_cm, annot = True, fmt='g')
ax.set_xlabel('Predicted')
ax.set_ylabel('Actual')
cm = confusion_matrix(split['test']['label'],predictions, normalize = 'true')
plot_cm(cm)
As we know, our dataset has a higher proportion of negative classes, and according to the confusion matrix, it appears that the model has a slight bias towards the negative class.
Same procedure can be used for any classification task.
Next, I will demonstrate how to train and fine-tune transformer models for other tasks on custom datasets.
If you enjoyed this post, please follow me and be sure to check out some of my other blog posts for more insights and information. I’m constantly exploring new topics and writing about my findings, so there’s always something new and interesting to discover.