Photo by Magdalena Smolnicka on Unsplash

Training an NLP Humor Model Using Habana Gaudi HPUs

Exploratory Data Analysis, Text Tokenization, and Model Training

Published in

Intel Analytics Software

6 min readDec 9, 2022

In a world where negativity in speech and media is prominent, humor can lift the spirit. Machine learning and deep learning can produce powerful language models, but “[creating] a method or model to discover the structures behind humor, recognize humor … remains a challenge because of its subjective nature” (Jain, 2017). A proposed challenge is to teach a computer how to distinguish between humorous and non-humorous English statements. In this tutorial, I will walk you through the training of a binary text classification model to determine whether a statement is humorous or not, all while running on a Habana Gaudi HPU (Habana Processing Unit) accelerator.

Model Architecture

I use a distilled version of the BERT transformer-based model architecture, called DistilBERT. BERT stands for Bidirectional Encoder Representations for Transformers, and it is a deep learning model for natural language processing (NLP) that can be used for a variety of language tasks. You can find a brief description of the model on Hugging Face.

Hardware

I will perform training using a Habana Gaudi HPU accelerator, hosted on AWS (a dl1.24xlarge EC2 instance). The HPU beats comparable GPU-based instances “by up to 40%” and lower cost (see https://habana.ai/training/gaudi/). Similar to NVIDIA’s watch nvidia-smi command-line tool, we can see HPU memory usage with watch hl-smi (Figure 1).

Figure 1. The “watch hl-smi” command can be used to see the memory usage of Habana Gaudi HPUs.

Python Libraries

I want to briefly highlight the key Python libraries I am using to train the model:

Habana SynapseAI fork of PyTorch looks and feels like the stock PyTorch, but it’s optimized for Habana Gaudi HPUs.
The Hugging Face transformers library to pull down the DistilBERT pretrained model and associated configuration prior to training.
For setting up the training, I am using optimum.habana, which is “the interface between the Transformers library and Habana’s Gaudi HPU.”

Now, we need to set the PyTorch to the HPU device:

Exploratory Data Analysis (EDA)

Let’s look at the data. Keep in mind that similar processing and model building could be used for any binary text classification dataset. We are working with a humor/not humor dataset. The data consist of a piece of text and an associated true/false humor label, e.g.:

Text: “An open letter to state farm about climate denial”
Humor Label: False

The data come from the “200K Short Texts for Humor Detection.” The goals are to train a model to obtain the highest possible F1 score (a combination of precision and recall) on an unseen test dataset at the fastest inference speed.

Now I load the data, converting the humor label to 1 (True) or 0 (False), and then split the data into training, validation, and test sets with the numpy split function:

Using the Python wordcloud library, we can create a word cloud of the humorous and non-humorous words. If you create one yourself, keep in mind that I did eliminate the default STOPWORDS, (i.e., common, low information words like “the”) to keep the word cloud a little more interesting. Figure 2 is a word cloud of the examples that are labelled as humorous. Figure 3 is the non-humorous word cloud.

Figure 2. Most common humorous words represented in a word cloud.

Figure 3. Most common non-humorous words represented in a word cloud.

Text Tokenization

To train the NLP model, we must first tokenize the text data. Tokenizing the text converts each sample into a vector of integers, each word, set of words, or symbol converting to a number according to a vocabulary tokenizer.

Now that we have instantiated the tokenizer, we can convert the samples from text to a vector of integers. Always pad to a certain length of integers (50 works here) to be consistent across all datasets and across all models. The model will not perform properly if the input vectors are different sizes. After instantiating the tokenizer, we can tokenize the training, validation, and test data:

Now that we have tokenized the data, we can create a class to convert the dataset into the torch.tensor format that PyTorch expects before training.

Let’s explore the tokenized data to see the text converted into a vector:

We see in Figure 4 a torch tensor vector padded to length 50 (index) as well as a corresponding attention mask. The attention mask simply tells the model to pay attention to the vector integer positions with 1’s and to ignore the vector integer positions with 0’s.

Training Setup

Let’s first define our training arguments, our metrics, load a pretrained model, and then start training. I begin by defining training arguments using the previously loaded GaudiTrainingArguments class from the optimum.habana framework. Of note here, I have provided the number of training epochs (25), the batch size (128, the number of text examples to handle at once in memory on the HPU), and the configuration from the optimum.habana framework.

I now define a metrics function so that during training we can use our previously defined validation dataset to measure the progress with inference at certain intervals during training. I am loading the F1 metric, because that is what we want to measure during model training.

Before launching the training, you can go to a command-line prompt and type watch hl-smi to monitor the HPUs during training.

Training

During training, we are fine-tuning a pretrained NLP model called “distilbert-base-uncased” (applying transfer learning). I first load the pretrained model, and make sure to place the model on the HPU device. Then, I define the trainer with the GaudiTrainer class, and give the training_args previously defined, the training dataset, the validation dataset, and the compute_metrics function I defined. I am also using a timer to see how long the training takes:

By watching the training loss, we can see that the training loss decreases with more training steps, which is what we want (Figure 5). The model saves at several checkpoints in the output_results folder.

Figure 5. Running fine-tuning training of DistilBERT model for humor text dataset.

In the study here, I only ran training on one of the eight available HPUs because the dataset is small. However, parallel training is also possible (see Hugging Face documentation here). Training 25 epochs on 11,700 examples took only a couple of minutes on the HPU accelerator, so in my case I did not need to run parallel training.

Inference on Test Dataset

Now, we can evaluate the model’s performance by doing inference on a previously unseen test dataset. We must set up the trainer as before, but this time with the test_dataset, which our model hasn’t seen before. We run trainer.evaluate() to measure the F1 score on the test dataset, the loss, and the speed at which it runs:

Figure 6 shows the output of the inference. Here we are showing an F1 score of 98%. A good F1 score generally would be 80% or above, so we have done really well with this model.

Figure 6. Running inference on an unseen dataset to measure the F1 score.

I encourage you to check out other Intel and Habana tools on the Habana developer site. The complete code can be found at this GitHub repository.