How to Fine-Tune an NLP Classification Model with HuggingFace

7 min readNov 12, 2023

This tutorial serves as a comprehensive guide for training a personalized Natural Language Processing (NLP) classification model using the transformers library by Hugging Face. The process initiates with the utilization of a pre-trained model, followed by fine-tuning through transfer learning techniques. The tutorial provides detailed steps for working with the Hugging Face transformers library, allowing users to adeptly create and enhance their NLP classification models.

Classification Model

For exhibition purposes, we will build a classification model trying to predict if a Twitter sentiment is either positive, negative, or neutral. Hugging Face is an open-source and platform provider of machine learning technologies. You can install their package to access some interesting pre-built models to use them directly or to fine-tune (retrain it on your dataset leveraging the prior knowledge coming with the first training), then host your trained models on the platform, so that you may use them later on other devices and apps.

The Hugging face models are Deep Learning based, so will need a lot of computational GPU power to train them. Please use Colab to do it, or your other GPU cloud provider, or a local machine having NVIDIA GPU.

Install the Required Libraries

For this tutorial, you can download the following libraries:

Load the Data

We have the train and test datasets stored as CSV files in Google Drive. Take note that HuggingFace requires the data to be a Dataset Dictionary.

I manually split the training set to have a training subset ( a dataset the model will learn on), and an evaluation subset ( a dataset the model with use to compute metric scores to help avoid some training problems like the overfitting one ).

Next, we save the splitted subsets and load them as datasets.

We then prepare our data for training a machine learning model by converting text data into tokenized format and ensuring the labels are in a numerical format suitable for modeling.

The ‘transform_labels’ function takes a dictionary with a ‘label’ key and maps the label to numerical values. The labels -1, 0, and 1 are mapped to 0, 1, and 2, respectively.

The ‘tokenize_data’ function then tokenizes the ‘safe_text’ column of each example using a tokenizer. It pads the sequences to a maximum length of 128. Additionally, it removes unnecessary columns (‘tweet_id’, ‘label’, ‘safe_text’, ‘agreement’).

Output:

DatasetDict object containing training and evaluation datasets, each with features like ‘input_ids’, ‘attention_mask’, and ‘labels’.

* input_ids: These are tokenized input sequences converted to IDs, ready for model input.

* attention_mask: This is a binary mask that indicates which tokens the model should pay attention to (usually 1 for tokens, 0 for padding).

* labels: These are the target labels for the text classification task.

Fine Tune the Model

In this problem, we will fine-tune the sentiment analysis model called “cardiffnlp/twitter-roberta-base-sentiment”. This model is based on RoBERTa, a variant of the BERT architecture, and has been fine-tuned for sentiment analysis on Twitter data by the Cardiff NLP research group. This model is specifically fine-tuned for sentiment analysis on Twitter data, so it should perform well on tweets or similar short-form text. The sentiment labels for our dataset are 0 (Negative),1 (Neutral), or 2 (Positive).

we start by loading the model using the Hugging Face Transformers library in Python.

we then instantiate the train and validation datasets as shown below

The ‘shuffle’ method is used to shuffle the examples within the training and validation sets. The ‘seed’ parameter is set to 10, ensuring reproducibility if you need to recreate the same shuffled datasets.

Training arguments

We then configure the training arguments for our model using the Hugging Face Transformers library.

These arguments provide a good starting point for training the model.

Pad Sequences

Since we are working with a sequence-based model that requires input sequences to be of the same length within a batch, we implement a ‘DataCollatorWithPadding’, which is a class from the Hugging Face Transformers library that helps collate and pad sequences in a dataset.

data_collator = DataCollatorWithPadding(tokenizer=tokenizer,padding ='max_length', max_length=128, return_tensors='pt')

It handles input sequences of varying lengths and efficiently batch them together. In natural language processing (NLP) tasks, input text sequences often have varying lengths. However, neural networks typically require fixed-length input sequences. By using ‘DataCollatorWithPadding’, we ensure that our input sequences are appropriately padded, making them compatible with transformer-based models and improving the overall training efficiency.

Define a custom trainer

We then define a custom trainer by subclassing the ‘Trainer’ class and overriding the ‘compute_loss’ method.

While the default trainers provided by Hugging Face are powerful and cover many use cases, defining a custom trainer becomes essential as we need to extend or modify the training process to suit the specific needs of the problem.

The function retrieves the labels from the input batch and performs a forward pass through the model using the provided inputs. The loss_function defines a custom loss function using cross-entropy loss with class weights and the loss is computed by applying the custom loss function to the logits and labels.

we use the nn.CrossEntropyLoss with class weights as we are dealing with an imbalanced dataset where some classes have significantly fewer examples than others, and the model may become biased toward the majority class.

Instantiate the trainer

Since the arguments and the compute loss function are set up, we define the evaluation metric and instantiate the trainer.

This configuration sets up the custom_trainer with the necessary components for training our model.

Train the model

The last step is to train the model using the “train” method.

custom_trainer.train()

This will initiate the training loop using the Roberta model, training and evaluation datasets, and training arguments. The CustomTrainer will handle the training process, including loss computation, parameter updates, and evaluation.

Output:

To evaluate the trained model on the provided evaluation dataset we use the “evaluate” method.

custom_trainer.evaluate()

This method computes various evaluation metrics based on the predictions of the model on the evaluation dataset. In this case, we are using the accuracy metric to evaluate the performance of the model.

output:

Represents the average loss computed on the evaluation dataset. It measures how well the model is performing in terms of minimizing the difference between predicted and actual labels. Lower values indicate better performance. An accuracy of 0.771 means that approximately 77.1% of the instances were predicted correctly.

Model Improvement

To improve the accuracy of the model, several strategies can be implemented such as Fine-Tuning Hyperparameters. This involves experimenting with different hyperparameter settings such as learning rate, batch size, and the number of training epochs. Alternatively, consider trying different model architectures or variations. For example, you could experiment with larger or smaller versions of your current model, or try a different pre-trained model from the Hugging Face model hub.

Push the Model

The last step is to push your fine-tuned model and tokenizer to the Hugging Face Model Hub.

custom_trainer.push_to_hub
tokenizer.push_to_hub( "roberta_fine_tuned_sentiment")

Conclusion

In conclusion, training a custom sentiment analysis model using transformers and fine-tuning techniques with Hugging Face’s library provides a powerful framework for creating models tailored to specific tasks. The process involves leveraging pre-trained transformer models, adapting them to domain-specific data, and optimizing performance through hyperparameter tuning.

Here is the link to the hugging face space where the pre-trained model is hosted: https://huggingface.co/spaces/phinm/pretrained_sentiment_Roberta

Appreciation

I highly recommend Azubi Africa for its comprehensive and effective programs. Read More articles at https://medium.com/@azubiafrica and take a few minutes to visit this link to learn more about Azubi Africa’s life-changing programs.

https://bit.ly/41CGCwK