Fine-Tuning Hugging Face Models for COVID-19 Vaccine Sentiment Analysis on Twitter Data

Newton Kimathi
8 min readSep 17, 2023

--

The emergence of social media platforms has significantly transformed the landscape of public discussions and information sharing. Twitter, is one of the most popular blogging platforms and has become a hub for discussions on a wide range of topics, including public health issues. In the context of the COVID-19 pandemic, social media played a pivotal role in shaping public opinion and spreading information about vaccination efforts.

Understanding the sentiment expressed in Twitter posts related to COVID-19 vaccines is of paramount importance. It offers insights into the public’s perception of vaccination initiatives, which can be invaluable for governments and public health agencies in tailoring effective communication strategies and policies. To address this need, this project focuses on developing a machine learning model for sentiment classification of COVID-19 vaccine-related tweets. Leveraging Hugging Face Transformers, we aim to fine-tune state-of-the-art models to accurately classify tweets as positive, negative, or neutral in sentiment.

Fine Tuning

Fine-tuning refers to the process of taking a pre-trained model, which has already learned extensive knowledge from a vast corpus of data, and further training it on a specific task or dataset (Cloud, 2023) .

Imagine you have a language model that has been pre-trained on a massive amount of text from the internet. This model has already learned grammar, vocabulary, and even some level of common sense. However, it doesn’t know how to perform task-specific jobs, like sentiment analysis or text classification.

Fine-tuning comes into play to adapt this general-purpose model to your specific needs. You provide it with a smaller, task-specific dataset and train it for a shorter period. During this process, the model learns to adjust its internal parameters to perform well on your particular task, whether it’s sentiment analysis, text generation, translation, or any other natural language processing task.

In the context of this project, fine-tuning Hugging Face models for COVID-19 vaccine sentiment analysis means taking a pre-trained language model and training it to understand and classify Twitter posts’ sentiment regarding COVID-19 vaccines. This allows the model to recognize positive, negative, and neutral sentiments in tweets, making it a valuable tool for analyzing public opinion and improving communication strategies related to vaccine efforts.

Sentiment Analysis Using Hugging Face

Hugging Face is an open-source and platform provider of machine learning technologies (Hugging Face) . You can install their package to access some interesting pre-built models to use them directly or to fine-tune (retrain it on your dataset leveraging the prior knowledge coming with the first training), then host your trained models on the platform, so that you may use them later on other devices and apps.

Please, [visit their website and sign-in](https://huggingface.co/) to access all the features of the platform. Read more about Text classification with Hugging Face here

The Hugging face models are Deep Learning based, so will need a lot of computational GPU power to train them.

Data Understanding

The data comes from tweets collected and classified through Crowdbreaks.org [Muller, Martin M., and Marcel Salathe. “Crowdbreaks: Tracking Health Trends Using Public Social Media Data and Crowdsourcing.” Frontiers in public health 7 (2019).]. Tweets have been classified as pro-vaccine (1), neutral (0) or anti-vaccine (-1). The tweets have had usernames and web addresses removed.

The variable description are as below:

  • tweet_id: Unique identifier of the tweet
  • safe_tweet: Text contained in the tweet. Some sensitive information has been removed like usernames and urls.
  • label: Sentiment of the tweet (-1 for negative, 0 for neutral, 1 for positive)
  • agreement: The tweets were labeled by three people. Agreement indicates the percentage of the three reviewers that agreed on the given label. You may use this column in your training, but agreement data is not be shared for the test set.

The objective is to develop a machine learning model to assess if a twitter post that is related to vaccinations is positive, neutral, or negative.

These are the first five entries of training data:

BERT Model

Bert Model

Bidirectional Encoder Representations from Transformers (BERT) is a state of the art model based on transformers developed by google. Bert was pre-trained on the BooksCorpus dataset and English Wikipedia. It obtained state-of-the-art results on eleven natural language processing tasks (Kamath et al., 2022).

Bert was trained on two tasks simultaneously

  • Masked language modelling (MLM) — 15% of the tokens were masked and was trained to predict the masked word
  • Next Sentence Prediction(NSP) — Given two sentences A and B, predict whether B follows A

BERT is designed to pre-train deep bidirectional representations from an unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.

BERT has several variants, including “bert-base-uncased” and “bert-base-cased,” which differ in their tokenization strategies (whether text is lowercase or case-sensitive), and there are larger versions of BERT with more parameters for even better performance.

Installations and Importations

The core components we’ll be using are Hugging Face’s Transformers and Datasets libraries. These libraries simplify the process of working with state-of-the-art natural language processing (NLP) models. We need to install the following modules:

pip install huggingface_hub
pip install datasets
pip install transformers
  • Huggingface_hub : We install the Hugging Face Hub library, which allows us to easily access and share pretrained models.
  • Transformers : The Library provides access to a wide range of pretrained models, including BERT, GPT-2, and more.
  • Datasets : The library is for Loading the dataset and also it provides an intuitive API to work with them.
from datasets import load_dataset

We are using the load_dataset from datasets library to download datasets from the hugging face hub or we can load our custom dataset.

Tokenization is the process of splitting raw text into smaller units called tokens (Hvitfeldt & Silge, 2021). Tokens are typically words or subwords. Tokenization is a fundamental step in NLP because it breaks down text data into manageable units that can be processed by NLP models. In this case we are using a pretrained tokenizer from Hugging Face’s Transformers library.

We will import the AutoTokenizer class from the Hugging Face Transformers library and then create an instance of the AutoTokenizer class and initializes it with a pre-trained BERT tokenizer.

# Import Autokenizer
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

We define a function tokenize_function takes an input argument, which is expected to be a dictionary. This dictionary should contain a key ‘text’ that corresponds to the text data you want to tokenize. The function, then calls the tokenizer object that was initialized earlier. It tokenizes the text data provided.

Padding is the process of adding special tokens (often <PAD>) to the sequences to make them uniform in length. Setting padding=’max_length’ in the tokenizer ensures that all tokenized sequences have the same maximum length.

The padding argument specifies that the tokenized sequences should be padded to the maximum sequence length in the batch. Padding ensures that all sequences have the same length.

#A function to tockenize the data
def tokenize_function(df):
return tokenizer(df['clean_text'], padding='max_length')

We apply the tokenize_data function to the dataset using the map function, in a
batched manner. This will tokenize each text field and padding the sequences to the maximum length in the batch.

The argument ‘batched=True’ allows to efficiently tokenize and pad multiple examples in a batch. Batch processing is essential for speeding up NLP tasks, as it allows the processing of multiple examples in parallel, which is especially beneficial when working with large datasets.

# Change the tweets to tokens that the models can exploit
dataset = dataset.map(tokenize_function, batched=True)

We define a function that will transform the sentiments labels from -1, 0, and 1 and converts them into a numerical format such that -1 becomes 0, 0 becomes 1, and 1 becomes 2.

def transform_labels(label):

label = label['label']
num = 0
if label == -1: #'Negative'
num = 0
elif label == 0: #'Neutral'
num = 1
elif label == 1: #'Positive'
num = 2

return {'labels': num}

# Transform labels and remove the useless columns
remove_columns = ['tweet_id', 'label', 'agreement','clean_text']
dataset = dataset.map(transform_labels, remove_columns=remove_columns)

We will then specify the training arguments as below:

# SPecifying the training arguments
from transformers import TrainingArguments

# Configure the trianing parameters like `num_train_epochs`:
# the number of time the model will repeat the training loop over the dataset
training_args = TrainingArguments("Covid_Vaccine_Sentiment_Analysis_Bert_based_Model",
num_train_epochs=3,
load_best_model_at_end=True,
push_to_hub=True,
evaluation_strategy="steps",
save_strategy="steps")

The AutoModelForSequenceClassification provides a unified interface for loading various pre-trained models (like BERT, RoBERTa, etc.) and fine-tuning them for sequence classification tasks.WE will load the pre-trained BERT model and configures it for sequence classification with the specified number of labels.

# We import the AutomodelForSequenceClassificatio class from the Transformers library
from transformers import AutoModelForSequenceClassification

# Loading a pretrain model while specifying the number of labels in our dataset for fine-tuning
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=3)

# We shuffle the dataset to randomnize the data and avoid any bias
train_dataset = dataset['train'].shuffle(seed=10)
eval_dataset = dataset['eval'].shuffle(seed=10)

We initialize the Trainer object from the Transformers library. The Trainer class is a high-level API that simplifies the training and evaluation of transformer-based models for various.

# WE import the Trainer class from the Transformers library.
from transformers import Trainer

# We initialize the trainer object and specify the arguments
trainer = Trainer(
model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset
)

# Launch the learning process: training
trainer.train()

The Training will run for number of epochs provided. Once training is done we can push the model into hugging face.

#To push the trained model into hugging face model hub
trainer.push_to_hub()

We run trainer.evalute() to check the accuracy, but before that, we need to import metrics.

from datasets import load_metric

metric = load_metric("accuracy")

def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
compute_metrics=compute_metrics,
)
# Launch the final evaluation
trainer.evaluate()

# Output
[250/250 01:02]
{'eval_loss': 0.6179400682449341,
'eval_accuracy': 0.744,
'eval_runtime': 63.1494,
'eval_samples_per_second': 31.671,
'eval_steps_per_second': 3.959}

The datasets library offers a wide range of metrics. We are using accuracy here. On our data, we got an accuracy of 74.4% by training for only 3 epochs.

Accuracy can be further increased by training for some more time or doing some more pre-processing of data, but due to computational cost, we will keep it that way for now.

Thanks for reading.

References

Cloud. (2023, September 8). How to fine-tune a pre-trained model. Saturn Cloud | Your data science cloud environment. https://saturncloud.io/blog/how-to-fine-tune-a-pre-trained-model/

(n.d.). Hugging Face — The AI community building the future. https://huggingface.co/

Hvitfeldt, E., & Silge, J. (2021). Tokenization. Supervised Machine Learning for Text Analysis in R, 9–36. https://doi.org/10.1201/9781003093459-3

Kamath, U., Graham, K. L., & Emara, W. (2022). Bidirectional encoder representations from transformers (BERT). Transformers for Machine Learning, 43–70. https://doi.org/10.1201/9781003170082-3

--

--

Newton Kimathi

Data Scientist || Data Analyst || Python, R, SQL, PowerBI , Machine Learning ||