Decoding Emotions: Sentiment Analysis with DistilBERT

Aditya Jethani
9 min readOct 16, 2023

--

1. Introduction

1.1 Sentiment Analysis

a. Understanding Sentiment Analysis

Examining and identifying the feelings, opinions, and attitudes that are being expressed in a text is called sentiment analysis. The goal of this technique, which belongs to the field of natural language processing (NLP), is to ascertain whether a text has a good, negative, or neutral emotional tone. Numerous industries, including marketing, finance, politics, and customer service, find substantial use for this analytical approach, which enables businesses to gain insightful information from client feedback.

b. Sentiment Analysis Varieties

Machine learning techniques are used in sentiment analysis to classify text sentiment as either positive, negative, or neutral. Sentiment analysis comes in three main flavors:

  1. Comprehensive Sentiment Analysis: This type of sentiment analysis divides text sentiment into many categories, such as positive, negative, and neutral portions, to allow for a more in-depth analysis.
  2. Binary Sentiment Analysis: Text sentiment is divided into just two categories using the binary sentiment analysis method positive or negative.
  3. Emotion Recognition: In this sort of Sentiment Analysis, the precise emotions expressed in a text, such as joy, sorrow, rage, and more, are identified.

1.2 Huggingface
a. A summary of hugging face

Hugging Face, a well-known open-source NLP library, provides a variety of Sentiment Analysis models that have already been trained as well as other NLP capabilities. It has won wide praise from the NLP community thanks to its simple interface, reliable API, and thorough documentation.

b. The Value of Hugging Face in NLP

With pre-trained models, such as sentiment analysis, Hugging Face has revolutionized NLP for researchers and developers by lowering the time and resource requirements for model training. By fine-tuning tools and APIs to fit particular datasets, this promotes state-of-the-art performance and expedites experimentation.

2. Dataset Exploration

We will go over how to adjust the DistilBERT model for categorizing emotions in this post, where we aim to classify text into different emotion categories such as joy, sadness, love, anger, fear, and surprise. We will use the Hugging Face library and the “emotions” dataset for training and evaluation.

Note: You can configure Jupyter notebook to use these resources if your computer supports GPU or TPU. In every other case, I advise switching to Google Colab.

2.1 Read, Split and Prepare data

Before delving into fine-tuning, let’s first explore the “sentiment” dataset by loading and testing its properties.

from datasets import load_dataset

emotions = load_dataset("emotion")

train_ds = emotions["train"]

print(train_ds.features)
print(train_ds[:5])
print(train_ds["text"][:5])
Details of Training, Testing, Validating datasets

The code snippet above loads the “emotion” dataset and retrieves the training partitions. We print the attributes of the data set, containing the “text” and “label” columns. Next, we display the first five observations from the data set and the corresponding cases. This helps us gain an understanding of the data to work with.

2.2 Converting to DataFrames

We transform the dataset into a pandas DataFrame to make data management and visualization easier. This makes it simple for us to carry out a variety of tasks.

import pandas as pd

emotions.set_format(type="pandas")
df = emotions["train"][:]
df.head()

To transform the dataset into a pandas DataFrame format, we utilize the set_format method.

2.3 Class distribution analysis

An understanding of class classification is important in the text classification problem. We want to ensure that our dataset is balanced across emotions. Imbalances in the class distribution can affect training design and evaluation design.

import matplotlib.pyplot as plt

df["label_name"].value_counts(ascending=True).plot.barh()
plt.title("Frequency of Classes")
plt.show()

In this code, we plot a horizontal bar chart to visualize the frequency of the sensitivity classes, highlighting important imbalances. There’s “happiness” and “sadness,” while “love” and “wonder” are less so. Dealing with this balance can be expensive, but for simplicity we use raw, imbalanced frequencies in this blog.

Whenever you are working on text classification problems, it is a good idea to examine the distribution of examples across the classes. A dataset with a skewed class distribution might require a different treatment in terms of the training loss and evaluation metrics than a balanced one.

Plotted Graph between Types of Emotions and Number of Such Labels

2.4 Longitudinal text analysis

Understanding the distribution of lengths in our data set allows us to analyze the quality of the data and potential challenges in training.

df["Words Per Tweet"] = df["text"].str.split().apply(len)
df.boxplot("Words Per Tweet", by="label_name", grid=False, showfliers=False, color="black")
plt.suptitle("")
plt.xlabel("")
plt.show()

By dividing the text along whitespace and using the len function, the aforementioned code snippet determines how many words are contained in each tweet. The distribution of the words per tweet for each mood class is then depicted using a boxplot. This enables us to determine if there are any text length disparities across classes that merit further investigation.

How long are our tweets?

3. Working with Model

3.1 Tokenization

In order for the model to process text, it must first be tokenized, which is a critical step in NLP activities. In this instance, we tokenize our text data using the DistilBERT tokenizer.

from transformers import AutoTokenizer

model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

def tokenize(batch):
return tokenizer(batch["text"], padding=True, truncation=True)

emotions_encoded = emotions.map(tokenize, batched=True)

We initialize the DistilBERT tokenizer using the “distilbert-base-uncased” checkpoint. We describe the tokenize function that takes a set of texts and implements tokenization through padding and truncation. We then use the mapping method from the sentiment dataset to tokenize the text data into batches.

3.2 Model Initialization and Configuration

We must initialize the pre-trained DistilBERT model with a classification head before we can commence fine-tuning. The training tool, the number of labels, and the mapping between label indices and emotion names are also defined.

from transformers import AutoModelForSequenceClassification
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
num_labels = 6
id2label = {
"0": "sadness",
"1": "joy",
"2": "love",
"3": "anger",
"4": "fear",
"5": "surprise"
}
label2id = {
"sadness": 0,
"joy": 1,
"love": 2,
"anger": 3,
"fear": 4,
"surprise": 5
}

model = AutoModelForSequenceClassification.from_pretrained(model_ckpt, num_labels=num_labels, id2label=id2label, label2id=label2id).to(device)
from sklearn.metrics import accuracy_score, f1_score

def compute_metrics(pred):
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
f1 = f1_score(labels, preds, average="weighted")
acc = accuracy_score(labels, preds)
return {"accuracy": acc, "f1": f1}

Using the AutoModelForSequenceClassification class from the transformers library, we instantiate a DistilBERT model for sequence classification in the supplied code. We use the from_pretrained argument to access the pre-trained “distilbert-base-uncased” checkpoint. We also build a mapping between label indices and associated emotion names, as well as the number of labels.

Before proceeding further, you need to do Huggingface login with Access Token. Go this url and then make a new token as shown below.

Click on the New token button and generate a new access token

Next, enter the following code to the notebook:

from huggingface_hub import notebook_login

notebook_login()

3.3 Initialization and Training Configuration
We must specify the training configuration and initialize the Trainer object before we can fine-tune the model.

from transformers import Trainer, TrainingArguments

batch_size = 64
logging_steps = len(emotions_encoded["train"]) // batch_size
model_name = f"{model_ckpt}-finetuned-emotion"
training_args = TrainingArguments(
output_dir=model_name,
num_train_epochs=2,
learning_rate=2e-5,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
weight_decay=0.01,
evaluation_strategy="epoch",
disable_tqdm=False,
logging_steps=logging_steps,
push_to_hub=True,
log_level="error"
)

trainer = Trainer(
model=model,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=emotions_encoded["train"],
eval_dataset=emotions_encoded["validation"],
tokenizer=tokenizer
)

The output directory, the number of training epochs, the learning rate, batch sizes, weight decay, evaluation technique, and logging settings are just a few of the training arguments that are defined in the aforementioned code piece. With the model, training arguments, metric computation function, training and validation datasets, and tokenizer, we build a Trainer object.

After giving HF token as input to the previous prompted code.

We can now use the Trainer object to begin the fine-tuning process.

trainer.train()

4. Evaluation Metrics

We next proceed to assessing the performance of our model after the training procedure is over. We use a variety of criteria, such as accuracy, which measures the percentage of correctly identified cases, and the F1 score, which strikes a compromise between recall and precision and offers a thorough performance assessment. These metrics give us a way to assess the efficacy and dependability of the trained model in correctly predicting the dataset’s emotional states.

preds_output = trainer.predict(emotions_encoded["validation"])
preds_output.metrics

To produce predictions for the validation dataset, the predict method, applied to the Trainer object, is a vital step. We take evaluation metrics from the preds_output object after prediction generation. Essential indicators including test loss, test accuracy, and test F1 score are included in these measurements. These metrics offer a thorough assessment of the model’s performance on the validation dataset, assisting with comprehension of its predictive power and efficacy.

5. Comparing BERT with RoBERTa

In this section we shall comparing our so fine-tuned DistilBERT model with fine-tuned RoBERTa model. The fine-tuning process will be in a similar fashion as of DistilBERT. Full code can be found in the Repository Linked Below.

5.1 Metrics Evaluation

We shall focus on the comparison of the two models based on their Metrics.

# Evaluate RoBERTa
preds_output_roberta = trainer_roberta.predict(emotions_encoded_roberta["validation"])
metrics_roberta = preds_output_roberta.metrics
print("RoBERTa Metrics:", metrics_roberta)

# Evaluate DistilBERT (assuming 'preds_output' is available)
metrics_distilbert = preds_output.metrics
print("DistilBERT Metrics:", metrics_distilbert)

This is the code for making metrics parameters to compare later.

# Compare metrics
print("Comparison of Metrics:")
#comparing the test accuracy.
print("Accuracy - DistilBERT:", metrics_distilbert["test_accuracy"], ", RoBERTa:", metrics_roberta["test_accuracy"])

This will give the output as follows:

Comparison between test accuracy

Similarly, you can compare different other metrics by the following code:

print("F1 Score - DistilBERT:", metrics_distilbert.keys(), ", RoBERTa:", metrics_roberta.keys())
Output truncated due to screen size

5.2 Other Advantages of BERT

DistilBERT’s scalability and ease of deployment make it particularly attractive, especially in resource-constrained environments. Remarkably, it achieves performance comparable to RoBERTa while requiring less time and resources for fine-tuning, thus proving itself as an efficient and effective tool for a wide array of NLP applications.

6. Deploying the Model to Hugging Face

Submitting a model to Hugging Face requires saving the optimized model and uploading it to Hugging Face Model Hub. This allows other users to access and manipulate the model through Hugging Face’s API. Once the model is uploaded, it can be shared with others and integrated into applications and workflows. The Hugging Face Model Hub also provides version control and allows models to be updated and refined over time.

6.1 Create a Model Repository on Hugging Face

Head to the hugging face homepage. Click on Signup and follow the instructions to create a hugging face account. Click on your profile navigate to and click on Settings to create a Access Token. This will later be used to login into your account to access a model repository.

Now Let’s create a model repository. Navigate to your profile and click “New Model” and create your repository.

Hugging Face Repository Window

6.2 Push Fine-tuned Model to Hugging Face

Before you push your model to your repository from your Colab notebook, you have to run the following commands.

# first install git
!apt-get install git -y
# This is to help save and cache your access token
!git config --global credential.helper store
# to login to hugging face
!huggingface-cli login

After executing the aforementioned commands, a prompt asking for your access token will appear. Please visit your Hugging face account, select “Access Tokens,” copy your token, and then paste it into the prompt. You may now access your repository thanks to this. Now, run the following in your notebook to make our model and tokenizer look like they are hugging:

# push your model and tokenizer to hugging face. This code even creates a model card for you.
trainer.push_to_hub()

After running this code, you can check your model repository to confirm the changes. If you set your repository to ‘public’ when you create it, this pattern can be used by anyone.

It is good practice to train more than one role model (Distilbert, Bert, Roberta) during a project so that you can choose which one worked best. Training the model on a set of datasets and tuning the parameters can also improve the model accuracy and f1 score.

7. Conclusion

In conclusion, our journey through emotion classification, guided by the power of DistilBERT and Hugging Face, has showcased the immense potential of NLP in understanding and decoding human sentiments. While comparing BERT with RoBERTa, BERT stood out with higher test accuracy and a more comprehensive pre-training process. As technology continues to advance, emotion classification offers a gateway to valuable insights across various industries.

--

--

Aditya Jethani

I turn data to decisions epoch by epoch. Designing, Implementing and optimizing AI/ML codes is my forte.