Text Classification with Transformers

14 min readMay 23, 2022

Introduction

Text classification is one of the most common tasks in NLP; it can be used for a broad range of applications, such as tagging customer feedback into categories or routing support tickets according to their language.

For example: Chances are that your email program’s spam filter is using text classification to protect your inbox from a deluge of unwanted junk!

Another common type of text classification is sentiment analysis, which aims to identify the polarity of a given text.

Now imagine that you are a data scientist who needs to build a system that can automatically identify emotional states such as “anger” or “joy” that people express about your company’s product on Twitter. In this article, we’ll tackle this task using a variant of BERT called as “DistilBERT”.

The main advantage of this model is that it achieves comparable performance to BERT, while being significantly smaller and more efficient.
This enables us to train a classifier in a few minutes, and if you want to train a larger BERT model you can simply change the checkpoint of the pretrained model.

Note: A checkpoint corresponds to the set of weights that are loaded into a given transformer architecture.

This will also be our first encounter with three of the core libraries from the Hugging Face ecosystem: 1. Datasets, 2. Tokenizers and 3. Transformers. As shown in below figure 1–1, these libraries will allow us to quickly go from raw text to a fine-tuned model that can be used for inference on new tweets.

The Dataset

To build our emotion detector we will use a great dataset from an article that explored how emotions are represented in English Twitter messages. Unlike most sentiment analysis datasets that involve just “positive” and “negative” polarities, this dataset contains six basic emotions: anger, disgust, fear, joy, sadness, and surprise. Given a tweet, our task will be to train a model that can classify it into one of these emotions.

We will use Datasets library to download the data from the Hugging Face Hub.

from datasets import load_dataset 
emotions = load_dataset(“emotion”)
train_ds = emotions["train"]emotions...DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})
...

Under emotions dataset, the data type of the text column is string, while the label column is a special ClassLabel object that contains information about the class names and their mapping to integers. We can also access several rows with a slice:

print(train_ds[:5])
...{'text': ['i didnt feel humiliated', 'i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake', 'im grabbing a minute to post i feel greedy wrong', 'i am ever feeling nostalgic about the fireplace i will know that it is still on the property', 'i am feeling grouchy'], 'label': [0, 0, 3, 2, 3]}

What if my datasets is not in Hub? You can convert your csv or text format data into Datasets lib format with load_dataset()
Example: load_dataset(“csv”, data_files=”my_file.csv”)

We can convert Dataset to dataframe easily but the labels under dataframe are represented as integers, so let’s use the int2str() method of the label feature to create a new column in our DataFrame with the corresponding label names.

def label_int2str(row):  
   return emotions["train"].features["label"].int2str(row) df["label_name"] = df["label"].apply(label_int2str) 
df.head()

Whenever you are working on text classification problems, it is a good idea to examine the distribution of examples across the classes. A dataset with a skewed class distribution might require a different treatment in terms of the training loss and evaluation metrics than a balanced one.

In this case, we can see that the dataset is heavily imbalanced; the joy and sadness classes appear frequently, whereas love and surprise are about 5–10 times rarer.

Transformer models have a maximum input sequence length that is referred to as the maximum context size. For applications using DistilBERT, the maximum context size is 512 tokens, which amounts to a few paragraphs of text. Hence, let us get a rough estimate of tweet lengths per emotion by looking at the distribution of words per tweet.

df["Words Per Tweet"] = df["text"].str.split().apply(len)
df.boxplot("Words Per Tweet", by="label_name", grid=False, 
           showfliers=False)
plt.suptitle("")
plt.xlabel("")
plt.show()

From the plot we see that for each emotion, most tweets are around 15 words long and the longest tweets are well below DistilBERT’s maximum context size. Texts that are longer than a model’s context size need to be truncated, which can lead to a loss in performance if the truncated text contains crucial information; in this case, it looks like that won’t be an issue.

From text to tokens — Subword Tokenization

The basic idea behind subword tokenization is to combine the best aspects of character and word tokenization. On the one hand, we want to split rare words into smaller units to allow the model to deal with complex words and misspellings. On the other hand, we want to keep frequent words as unique entities so that we can keep the length of our inputs to a manageable size. The main distinguishing feature of subword tokenization (as well as word tokenization) is that it is learned from the pre‐ training corpus using a mix of statistical rules and algorithms.

There are several subword tokenization algorithms that are commonly used in NLP, but let’s start with WordPiece which is used by the BERT and DistilBERT tokenizers. The easiest way to understand how WordPiece works is to see it in action. Trans‐ formers provides a convenient AutoTokenizer class that allows you to quickly load the tokenizer associated with a pretrained model, we just call its from_pretrained() method, providing the ID of a model on the Hub or a local file path. Let’s start by loading the tokenizer for DistilBERT:

from transformers import AutoTokenizer
model_ckpt = "distilbert-base-uncased"tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

Let’s examine how this tokenizer works by feeding it our simple “Tokenizing text is a core task of NLP.” example text:

text = "Tokenizing text is a core task of NLP."
encoded_text = tokenizer(text)
print(encoded_text)
...
{'input_ids': [101, 19204, 6026, 3793, 2003, 1037, 4563, 4708, 1997, 17953, 2361, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Now that we have the input_ids, we can convert them back into tokens by using the tokenizer’s convert_ids_to_tokens() method

tokens = tokenizer.convert_ids_to_tokens(encoded_text.input_ids)
print(tokens)
...
['[CLS]', 'token', '##izing', 'text', 'is', 'a', 'core', 'task', 'of', 'nl', '##p', '.', '[SEP]']

We can observe three things here. First, some special [CLS] and [SEP] tokens have been added to the start and end of the sequence. These tokens differ from model to model, but their main role is to indicate the start and end of a sequence. Second, the tokens have each been lowercased, which is a feature of this particular checkpoint. Finally, we can see that “tokenizing” and “NLP” have been split into two tokens, which makes sense since they are not common words. The ## prefix in ##izing and ##p means that the preceding string is not whitespace; any token with this prefix should be merged with the previous token when you convert the tokens back to a string. The AutoTokenizer class has a convert_tokens_to_string() method for doing just that, so let’s apply it to our tokens:

print(tokenizer.convert_tokens_to_string(tokens))
...
[CLS] tokenizing text is a core task of nlp. [SEP]

To tokenize the whole corpus, we’ll use the map() method of our DatasetDict object. To get started, the first thing we need is a processing function to tokenize our examples with:

def tokenize(batch):
    return tokenizer(batch["text"], padding=True, truncation=True)

This function applies the tokenizer to a batch of examples; padding=True will pad the examples with zeros to the size of the longest one in a batch, and truncation=True will truncate the examples to the model’s maximum context size. Below are the examples from first 2 records.

print(tokenize(emotions["train"][:2])) ...
{'input_ids': [[101, 1045, 2134, 2102, 2514, 26608, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [101, 1045, 2064, 2175, 2013, 3110, 2061, 20625, 2000, 2061, 9636, 17772, 2074, 2013, 2108, 2105, 2619, 2040, 14977, 1998, 2003, 8300, 102]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

Also note that in addition to returning the encoded tweets as input_ids, the tokenizer returns a list of attention_mask arrays. This is because we do not want the model to get confused by the additional padding tokens: the attention mask allows the model to ignore the padded parts of the input. Figure 3–1 provides a visual explanation of how the input IDs and attention masks are padded.

Figure 3–1. For each batch, the input sequences are padded to the maximum sequence length in the batch; the attention mask is used in the model to ignore the padded areas of the input tensors

emotions_encoded = emotions.map(tokenize, batched=True, batch_size=None)

By default, the map() method operates individually on every example in the corpus, so setting batched=True will encode the tweets in batches. Because we’ve set batch_size=None, our tokenize() function will be applied on the full dataset as a single batch. This ensures that the input tensors and attention masks have the same shape globally, and we can see that this operation has added new input_ids and attention_mask columns to the dataset.

emotions_encoded
...
DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask'],
        num_rows: 2000
    })
})

Training a Text Classifier

Models like DistilBERT are pretrained to predict masked words in a sequence of text. However, we can’t use these language models directly for text classification; we need to modify them slightly. To understand what modifications are necessary, let’s take a look at the architecture of an encoder-based model like DistilBERT, which is depicted in Figure 4–1.

Figure 4–1. The architecture used for sequence classification with an encoder-based transformer; it consists of the model’s pretrained body combined with a custom classification head

First, the text is tokenized and represented as one-hot vectors called token encodings. The size of the tokenizer vocabulary determines the dimension of the token encodings, and it usually consists of 20k–200k unique tokens. Next, these token encodings are converted to token embeddings, which are vectors living in a lower-dimensional space. The token embeddings are then passed through the encoder block layers to yield a hidden state for each input token. For the pretraining objective of language modeling, each hidden state is fed to a layer that predicts the masked input tokens. For the classification task, we replace the language modeling layer with a classification layer.

We have two options to train such a model on our Twitter dataset:

1. Transformers as Features Extractors

Using a transformer as a feature extractor is fairly simple. As shown in Figure 4–2, we freeze the body’s weights during training and use the hidden states as features for the classifier. The advantage of this approach is that we can quickly train a small or shallow model. Such a model could be a neural classification layer or a method that does not rely on gradients, such as a random forest. This method is especially convenient if GPUs are unavailable, since the hidden states only need to be precomputed once.

Figure 4–2. In the feature-based approach, the DistilBERT model is frozen and just provides features for a classifier

1.a Using Pretrained models

We will use another convenient auto class from Transformers called TFAutoModel. Similar to the AutoTokenizer class, TFAutoModel has a from_pretrained() method to load the weights of a pretrained model. Let’s use this method to load the DistilBERT checkpoint.

from transformers import TFAutoModel 
tf_model = TFAutoModel.from_pretrained(model_ckpt)

Here, you can specify a from_pt=True argument to the TfAutoModel.from_pretrained() function, and the library will automatically download and convert the PyTorch weights for you.

1.b Extracting last hidden states

Let’s retrieve the last hidden states for a single string. The first thing we need to do is encode the string and convert the tokens to tensors. For example

text = "this is a test"
inputs = tokenizer(text, return_tensors='tf')
inputs
...
{'input_ids': <tf.Tensor: shape=(1, 6), dtype=int32, numpy=array([[ 101, 2023, 2003, 1037, 3231,  102]])>, 'attention_mask': <tf.Tensor: shape=(1, 6), dtype=int32, numpy=array([[1, 1, 1, 1, 1, 1]])>}

As we can see, the resulting tensor has the shape [batch_size, n_tokens]. Also with token_ids and attention_mask. Now let us pass our tensors through loaded model.

outputs = tf_model(inputs)
outputs
...
TFBaseModelOutput(last_hidden_state=<tf.Tensor: shape=(1, 6, 768), dtype=float32, numpy=
array([[[-0.15651299, -0.18619648,  0.05277675, ..., -0.11881144,
          0.06620621,  0.5470156 ],
        [-0.35751376, -0.64835584, -0.06178998, ..., -0.30401963,
          0.35076842,  0.52206874],
        [-0.27718487, -0.44594443,  0.18184263, ..., -0.0947793 ,
         -0.00757531,  0.9958287 ],
        [-0.28408554, -0.39167657,  0.37525558, ..., -0.21505763,
         -0.11725216,  1.0526482 ],
        [ 0.26608223, -0.5093635 , -0.31801343, ..., -0.42029822,
          0.01444213, -0.21489497],
        [ 0.9440607 ,  0.01117252, -0.47139427, ...,  0.14394698,
         -0.7287834 , -0.16194956]]], dtype=float32)>, hidden_states=None, attentions=None)

Depending on the model configuration, the output can contain several objects, such as the hidden states, losses, or attentions, arranged in a class similar to a namedtuple in Python. In our example, the model output is an instance of TFBaseModelOutput, and we can simply access its attributes by name. The current model returns only one attribute, which is the last hidden state, so let’s examine its shape.

outputs.last_hidden_state.shape
...
TensorShape([1, 6, 768])

Looking at the hidden state tensor, we see that it has the shape [batch_size, n_tokens, hidden_dim]. In other words, a 768-dimensional vector is returned for each of the 6 input tokens. Now we know how to get the last hidden state for a single string; let’s do the same for the whole dataset by creating a new hidden_state column that stores all these vectors. As we did with the tokenizer, we’ll use the map() method of DatasetDict to extract all the hidden states in one go. The first thing we need to do is wrap the previous steps in a processing function.

emotions_encoded.reset_format()def extract_hidden_states(batch):
    # First convert text to tokens
    inputs = tokenizer(batch["text"], padding=True, 
                       truncation=True, return_tensors='tf')
    # Extract last hidden states
    outputs = tf_model(inputs)
     # Return vector for [CLS] token
    return {"hidden_state": outputs.last_hidden_state[:,0].numpy()}

We can then go ahead and extract the hidden states across all splits in one go.

emotions_hidden = emotions_encoded.map(extract_hidden_states, batched=True, batch_size=512)
emotions_hidden
...
DatasetDict({
    train: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask', 'hidden_state'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask', 'hidden_state'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label', 'input_ids', 'attention_mask', 'hidden_state'],
        num_rows: 2000
    })
})

Notice that we had set batch_size=512 in this case (default is 1000), this is to avoid any Out of Memory errors.

As expected, applying the extract_ hidden_ states() function has added a new hidden_state column to our dataset. the next step is to train a classifier on them. To do that, we’ll need a feature matrix . Let’s take a look.

1.c Creating feature matrix

The preprocessed dataset now contains all the information we need to train a classifier on it. We will use the hidden states as input features and the labels as targets. We can easily create the corresponding arrays in the well-known Scikit-learn format as follows.

import numpy as np X_train = np.array(emotions_hidden["train"]["hidden_state"]) 
X_valid = np.array(emotions_hidden["validation"]["hidden_state"]) 
y_train = np.array(emotions_hidden["train"]["label"]) 
y_valid = np.array(emotions_hidden["validation"]["label"]) X_train.shape, X_valid.shape
...
((16000, 768), (2000, 768))

Let us use the hidden states to train a logistic regression model with Scikit-learn.

from sklearn.linear_model import LogisticRegression 
# We increase `max_iter` to guarantee convergence 
lr_clf = LogisticRegression(max_iter=3000) 
lr_clf.fit(X_train, y_train) 
lr_clf.score(X_valid, y_valid)
...
0.6315

In the next section we will explore the fine-tuning approach, which leads to superior classification performance. Let us plot confusion matrix.

from sklearn.metrics import confusion_matrix
import seaborn as snsdef plot_cm(y_true, y_pred, figsize=(15, 15)):
    cm = confusion_matrix(y_true, y_pred, labels=np.unique(y_true))
    cm_sum = np.sum(cm, axis=1, keepdims=True)
    cm_perc = cm / cm_sum.astype(float) * 100
    annot = np.empty_like(cm).astype(str)
    nrows, ncols = cm.shape
    for i in range(nrows):
        for j in range(ncols):
            c = cm[i, j]
            p = cm_perc[i, j]
            if i == j:
                s = cm_sum[i]
                annot[i, j] = '%.1f%%\n%d/%d' % (p, c, s)
            elif c == 0:
                annot[i, j] = ''
            else:
                annot[i, j] = '%.1f%%\n%d' % (p, c)
    cm = pd.DataFrame(cm, index=np.unique(y_true), columns=np.unique(y_true))
    cm.index.name = 'Actual'
    cm.columns.name = 'Predicted'
    fig, ax = plt.subplots(figsize=figsize)
    sns.heatmap(cm, cmap= "YlGnBu", annot=annot, fmt='', ax=ax)df_eval = pd.DataFrame({'y_true': y_valid, 'y_preds': y_preds})
df_eval['y_true'] = (df_eval['y_true'].apply(label_int2str))
df_eval['y_preds'] = (df_eval['y_preds'].apply(label_int2str))plot_cm(df_eval['y_true'], df_eval['y_preds'])

We can see that anger and fear are most often confused with sadness, which agrees with the observation we made when visualizing the embeddings. Also, love and surprise are frequently mistaken for joy.

2. Fine-Tuning Transformers

With the finetuning approach we do not use the hidden states as fixed features, but instead train them as shown in Figure 2–6. This requires the classification head to be differentiable, which is why this method usually uses a neural network for classification.

Figure 4–2. When using the fine-tuning approach the whole DistilBERT model is trained along with the classification head.

Training the hidden states that serve as inputs to the classification model will help us avoid the problem of working with data that may not be well suited for the classification task. Instead, the initial hidden states adapt during training to decrease the model loss and thus increase its performance. We’ll be using the Trainer API from Transformers to simplify the training loop. Let’s look at the ingredients we need to set one up.

2.a Loading a pretrained model

Let usfirst load DistilBERT as a TensorFlow model. The TFAutoModelForSequenceClassification model has a classification head on top of the pretrained model outputs, which can be easily trained with the base model. We just need to specify how many labels the model has to predict (6 in our case).

from transformers import TFAutoModelForSequenceClassificationtf_model = (TFAutoModelForSequenceClassification.from_pretrained(model_ckpt, num_labels=6))

You will see a warning that some parts of the model are randomly initialized. This is normal since the classification head has not yet been trained.

Next, we’ll convert our datasets into the tf.data.Dataset format. Because we have already padded our tokenized inputs, we can do this conversion easily by applying the to_tf_dataset() method to emotions_encoded.

from transformers import DataCollatorWithPaddingtokenizer_columns = tokenizer.model_input_names# Define a batch size
batch_size = 512data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")tf_train_dataset = emotions_encoded["train"].to_tf_dataset(columns=tokenizer_columns,                                                           label_cols=["label"], shuffle=True,                                                         batch_size=batch_size,                                                          collate_fn=data_collator)tf_eval_dataset = emotions_encoded["validation"].to_tf_dataset(columns=tokenizer_columns, label_cols=["label"], shuffle=False,                 batch_size=batch_size,                                                      collate_fn=data_collator)tf_train_dataset
...
<PrefetchDataset shapes: ({input_ids: (512, None), attention_mask: (512, None)}, (512,)), types: ({input_ids: tf.int64, attention_mask: tf.int64}, tf.int64)>

To be able to build batches, data collators with padding applies some processing (like padding) along with batching.

Here we have also shuffled the training set, and defined the batch size for it and the validation set. The last thing to do is compile and train the model.

import tensorflow as tftf_model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=5e-5),                loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=tf.metrics.SparseCategoricalAccuracy())
tf_model.fit(tf_train_dataset, validation_data=tf_eval_dataset, epochs=2)loss, eval_accuracy = tf_model.evaluate(tf_eval_dataset)
print("Loss: {}\t Test Accuracy: {}".format(loss, eval_accuracy))
...
Loss: 0.14925077557563782	 Test Accuracy: 0.9330000281333923

2.b Prediction on evaluation set

Let us predict on tf_eval_dataset.

output_logits = tf_model.predict(tf_eval_dataset).logits
pred_labels = np.argmax(output_logits, axis=-1)

Finally, we create a DataFrame with the texts, losses, and predicted/true labels.

emotions_encoded.set_format("pandas") 
cols = ["text", "label", "predicted_label", "loss"]df_test = emotions_encoded["validation"][:][cols] 
df_test["label"] = df_test["label"].apply(label_int2str) 
df_test["predicted_label"] = (df_test["predicted_label"]  .apply(label_int2str))df_test.head(4)

With the predictions, we can plot the confusion matrix again.

plot_cm(df_test.label, df_test.predicted_label)

This is much closer to the ideal diagonal confusion matrix. The love category is still often confused with joy, which seems natural. surprise is also frequently mistaken for fear. Overall the performance of the model seems quite good.

Conclusion

Congratulations, you now know how to train a transformer model to classify the emotions in tweets! We have seen two complementary approaches based on features and fine-tuning, and investigated their strengths and weaknesses.

References

Natural Language Processing with Transformers

Code Link on Github: https://github.com/ashushekar/Transformers.git