STEP-BY-STEP GUIDE

Hugging Face DistilBert & Tensorflow for Custom Text Classification.

How to fine-tune DistilBERT for text binary classification via Hugging Face API for TensorFlow.

Galina Blokh

Published in

Geek Culture

6 min readFeb 18, 2021

Intro.

In this tutorial, you will see a binary text classification implementation with the Transfer Learning technique. For this purpose, we will use the DistilBert, a pre-trained model from the Hugging Face Transformers library and its API for Tensorflow.

Why DistilBert.

Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT .

The review article’s header from Hugging Face on Medium gives a full explanation of why we should use this model in our task. We have a small data set, and this model can be a nice first choice to try for us. Also, another article on Medium suggests using DistilBERT as a fast baseline model. DistilBERT can achieve a sensible lower-bound on BERT’s performances with the advantage of quicker training. The API we take in a very good written Hugging Face documentation and Tensorflow blog. You will see how easy and intuitive to apply it.

… a smaller general-purpose language representation model, called DistilBERT, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts.

If you are still in doubt about which model to choose from the Hugging Face library, you can use their filter to select a model by task, library, language, etc. DistilBERT is the first in the list for text classification task (a fine-tune checkpoint of DistilBERT-base-uncased, fine-tuned on SST-2). So we chose it — great!

The retrospective look on data.

The data for code example I took from my previous scraping project. I collected it from a recipe website, split it into train and test sets (split proportion 0.2). Data sets contain a column with the source text (you may read about the data set here or check this notebook). There is a column with labels as well. The business goal is to determine for each paragraph its label is “ingredients” or “recipe instructions”.

Let’s install, import libraries, and define constants for the model’s hyperparameters:

!pip install transformersimport pandas as pd
import tensorflow as tf
import transformers
from transformers import DistilBertTokenizer
from transformers import TFDistilBertForSequenceClassificationpd.set_option('display.max_colwidth', None)
MODEL_NAME = 'distilbert-base-uncased-finetuned-sst-2-english'
BATCH_SIZE = 16
N_EPOCHS = 3

At the moment, we are interested only in the “paragraph” and “label” columns. Look at the picture below (Pic.1): the text in “paragraph” is a source text, and it is in byte representation. In the X_train set, we have 3898 rows, the X_test set — 973 rows. In these sets no NaNs or empty strings.

Pic.1 Load Train and Test data sets, a sample from X_train, shape check.

The target variable is “1” if the paragraph is “recipe ingredients” and “0” if it is “instructions”. The proportion of labels is about 20% ones and 80% zeroes. Now, let’s move on to the next step.

Prepare data as model input.

Before text become a model input, first of all, we should tokenize it. The DistilBertTokenizer accepts text of type “str” (single example), “List[str]” (batch or single pretokenized example), or “List[List[str]]” (batch of pretokenized examples). Thus, we need to transform a byte representation into a string. Lambda function is a nice solution.

X_train = X_train.apply(lambda x: str(x[0], 'utf-8'))
X_test = X_test.apply(lambda x:  str(x[0], 'utf-8'))

Maximally supported tokenized sentence length in DistilBERT is 512 words.

#define a tokenizer object
tokenizer = DistilBertTokenizer.from_pretrained(MODEL_NAME)#tokenize the text
train_encodings = tokenizer(list(X_train.values),
                            truncation=True, 
                            padding=True)test_encodings = tokenizer(list(X_test.values),
                           truncation=True, 
                           padding=True)

Parameters we pass the Tokenizer hold: our set in “list[str]” representation, truncation=True, and padding=True. If the tokenized sentence length is less than the max model input length, the tokenizer truncates it up to the tokenized sentence max length. If the tokenized sentence length is less than the max tokenized sentence length, the tokenizer fills by zeroes up to max tokenized sentence length. In the picture below, you see the result’s example:

Pic2. An example of tokenized sentece by DistilBertTokenizer.

The DistilBertTokenizer refers to the superclass BertTokenizer. It returned us a tuple of input indexes and attention masks. Now we need only turn our labels and encodings into a Tensorflow Dataset object:

train_dataset = 
tf.data.Dataset.from_tensor_slices((dict(train_encodings),
                                    list(y_train.values)))
test_dataset = 
tf.data.Dataset.from_tensor_slices((dict(test_encodings),
                                    list(y_test.values)))

Fine-tuning with native TensorFlow.

In the next step, we take a TFDistilBertForSequenceClassification and point the model’s name as a parameter. Set a learning rate and define the loss function. Compile the model and run a model.fit() method for training.

model = TFDistilBertForSequenceClassification.from_pretrained(MODEL_NAME)#chose the optimizer
optimizerr = tf.keras.optimizers.Adam(learning_rate=5e-5)#define the loss function 
losss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)#build the model
model.compile(optimizer=optimizerr,
              loss=losss,
              metrics=['accuracy'])# train the model 
model.fit(train_dataset.shuffle(len(X_train)).batch(BATCH_SIZE),
          epochs=N_EPOCHS,
          batch_size=BATCH_SIZE)

Below you see how accurate our model is. On the second epoch, we already got an accuracy of 100%:

>>> Epoch 1/3
>>> 244/244 [==============================] - 131s 374ms/step - 
>>> loss: 0.1468 - accuracy: 0.9568
>>> Epoch 2/3
>>> 244/244 [==============================] - 95s 388ms/step - 
>>> loss: 3.1370e-04 - accuracy: 1.0000
>>> Epoch 3/3
>>> 244/244 [==============================] - 97s 396ms/step -
>>> loss: 5.7763e-05 - accuracy: 1.0000

Model evaluation.

Hugging Face API for Tensorflow has intuitive for any data scientist methods. Let’s evaluate the model on the test set and unseen before new data:

# model evaluation on the test set
model.evaluate(test_dataset.shuffle(len(X_test)).batch(BATCH_SIZE), 
               return_dict=True, 
               batch_size=BATCH_SIZE)
>>> 61/61 [==============================] - 10s 147ms/step - 
>>> loss: 1.7124e-05 - accuracy: 1.0000
>>> {'accuracy': 1.0, 'loss': 1.7123966244980693e-05}

We got pretty good results! Now, for model estimation with other text paragraphs, we create a function to see a prediction probability for each class (to see how sure our model in prediction):

def predict_proba(text_list, model, tokenizer):      #tokenize the text
    encodings = tokenizer(text_list, 
                          max_length=MAX_LEN, 
                          truncation=True, 
                          padding=True)
    #transform to tf.Dataset
    dataset = tf.data.Dataset.from_tensor_slices((dict(encodings)))    #predict
    preds = model.predict(dataset.batch(1)).logits  
    
    #transform to array with probabilities
    res = tf.nn.softmax(preds, axis=1).numpy()      
    
    return res

We take a .txt file here. This file contains ten URLs to the ten recipe pages. Our model didn’t saw text data from them yet. Assuming you took data from the first URL. The list with strings you feed into your model for prediction will be looking like in the cell below. (*A list with first string with ingredients and following three with instructions):

strings_list =["""
                  1 pound green beans, trimmed
                  ½ head radicchio, sliced into strips
                  Scant ¼ cup thinly sliced red onion
                  Honey Mustard Dressing, for drizzling
                  2 ounces goat cheese
                  2 tablespoons chopped walnuts
                  2 tablespoons sliced almonds
                  ¼ cup tarragon
                  Flaky sea salt""",               """Bring a large pot of salted water to a boil and                  set a bowl of ice water nearby. Drop the green beans into the boiling water and blanch for 2 minutes. Remove the beans and immediately immerse in the ice water long enough to cool completely, about 15 seconds. Drain and place on paper towels to dry. """,               """Transfer the beans to a bowl and toss with the radicchio, onion, and a few spoonfuls of the dressing.""",               """Arrange on a platter and top with small dollops of goat cheese, the walnuts, almonds, and tarragon. Drizzle with more dressing, season to taste with flaky salt, and serve."""]

When you call a predict_proba() function for new data, the result will be a NumPy array with a shape (4,2). Four ndArrays (one for each paragraph) with two probability values (for class 0 and class 1):

predict_proba(string1, model, tokenizer)>>> array([
>>> [1.63417135e-05, 9.99983668e-01],
>>> [9.99986053e-01, 1.39580325e-05],
>>> [9.99986053e-01, 1.39833473e-05],        
>>> [9.99988914e-01, 1.11078716e-05]], dtype=float32)

The model is very sure in its prediction and correct.

Conclusions.

In this article, you learned how to fine-tune the DistilBert, a pre-trained model from the Hugging Face Transformers library, and its API for Tensorflow in a binary classification task with custom small text data.

The GoogleCollab notebook for this article.

A Hugging Face documentation page, which regroups resources around 🤗 Transformers developed by the community.