Fine-tuning Hugging Face Transformer using Keras API (TF)

5 min readJul 22, 2023

Fine-tuning is a way to train your models faster, we take a pre-trained model and like tuning guitar we make minor tweaks in the weights and bias with our dataset.

what are the benefits?

Better for the environment

Fine-tuning is more eco-friendly than training a model from scratch, and training any NLP model is resource intensive, so we should use a more economical way which also gives better results

Faster to train

What do we mean by pre-trained? some very friendly people have already trained a model for us, so when we start fine-tuning we are at a much better starting point for the weights and bias than random initialized, and we may have to retrain some other extra layers if needed. This makes the process of training the model really fast

Abstraction

When we are using a pre-trained model we don’t need to know about it’s internal architecture thus saving us the hassle of creating an architecture for simple tasks, but you may need to know some of the internal architecture if you want to add more layers or want to bring some major changes.

Before I jump into the explanation

Here is the dataset :

News Headlines Dataset For Sarcasm Detection

High quality dataset for detection of sarcasm and fake news

Here is the notebook

sarcasm detection notebook by Vishakh Prakash

Install the hugging face package

!pip install transformers

In google colab use the above code to install the hugging face package

I love to split’em up

Like a sadistic divorce lawyer would say You would be better off separate!

We have to separate all the words in a sentence into tokens, for that we have to first import an object called the tokenizer. Tokenizer class contains all methods related to creating tokens.

Tokenizing means converting the words in a sentence into unique numbers because models only understand numbers. We can create a unique tokenizer if we were creating the model from scratch but as we are taking a pre-trained model, we will take the corresponding tokenizer, so that the words mean the same thing in our dataset and the pre-trained model.

from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

The transformer we are going to use is DistilBERT which is a lightweight counterpart of BERT, so we will download the corresponding tokenizer.

Next we are creating an instance of the tokenizer and then loading pre-trained values into the tokenizer with the second line. ‘distilbert-base-uncased’ is case insensitive.

Create the dataset

import the dataset and create a dataset using pandas(don’t forget to import pandas)

df = pd.read_json("/content/drive/MyDrive/Sarcasm_Headlines_Dataset.json")
df = df[['headline','is_sarcastic']]

Now we will create a TensorFlow dataset, Instead of using numpy arrays, it is better to use a TensorFlow.data.Dataset as it provides hoards of functionalities.

import tensorflow as tf
buffer_size = 64

is_train = np.random.uniform(size = len(df))<0.8

train_raw = (tf.data.Dataset.
             from_tensor_slices((dict(tokenizer(list(df['headline'][is_train]), padding = True, truncation = True)),np.array(df['is_sarcastic'])[is_train ])).
             shuffle(len(df)).
             batch(64,drop_remainder = True)
            )

train_raw.prefetch(1)

test_raw = (tf.data.Dataset.
            from_tensor_slices((dict(tokenizer(list(df['headline'][~is_train]), padding = True, truncation = True)),np.array(df['is_sarcastic'])[~is_train ])).
            shuffle(len(df)).
            batch(64, drop_remainder = True)
            )


test_raw.prefetch(1)

The data is divided into two dataset, train_raw for training and test_raw for evaluation. Both of them have the same structure. 80% of data in train_raw and rest in test_raw

Here the line I think that needs the most explanation is this line

from_tensor_slices((dict(tokenizer(list(df['headline'][~is_train]), padding = True, truncation = True)),np.array(df['is_sarcastic'])[~is_train ]))

from_tensor_slices is a method that is used to create the dataset, it should receive a tuple as the argument of the form (input, output),

where the input is —

dict(tokenizer(list(df['headline'][~is_train]), padding = True, truncation = True))

I will be explaining this line inside out —

First the headlines are selected from the dataset from the indexes where is_train is 1. This will then be converted to a list by passing into a list() method.

The list of headlines will then be passed into the tokenizer we initialiazed, padding = True and truncation = True parameters make all the tensors of the same size.

Convert this tokenizer into a dictionary, this step is really important as the model won’t run as the output of the tokenizer is a BatchEncoding object which is a subclass of dict, but this object is not recognized by the keras so it is converted to a dictionary.

The output is —

np.array(df['is_sarcastic'])[is_train ]

Here the second row of the data frame is selected which consists of 1 if sarcastic and 0 if not. The row is converted to a numpy array. from this array only those indexes where is_train is 1 is selected

Load the Guns and fire

Look at the third line, we are importing a transformer TFDisitilBertForSequenceClassification which is a transformer architecture for multi-class classification but here we will use it for binary classification.

The optimizer used here is Adam, you can use your own custom made optimizer too. The loss function used is sparse categorical loss, you can also binary cross entropy loss.

from tensorflow.keras.optimizers import 
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from transformers import TFDistilBertForSequenceClassification
num_epochs = 3
model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
adam = Adam()

model.compile(
    optimizer=Adam(5e-5),
    metrics=["accuracy"],
    loss = SparseCategoricalCrossentropy(from_logits = True)
)
model.fit(
    train_raw,
    validation_data=test_raw,
    epochs = num_epochs
)

Check your accuracy

for checking the accuracy of the model we test it on test_raw

model.evaluate(test_raw)

output on evaluation

You can also use it to predict for yourself if a line is sarcastic or not. One thing to notice is that the probabilities generated are passed through a soft-max layer so that the output adds up to 1.

text = "man walks on his hand going to work"
text_tok = dict(tokenizer(text , truncation = True, padding = True, return_tensors = 'tf' ))
prediction = model.predict(text_tok)

logits = prediction.logits

probabilities = tf.nn.softmax(logits, axis = 1)
print(probabilities)
class_probability = probabilities[0]
clp = class_probability.numpy()
print(class_probability.numpy())

if clp[0] > clp[1]:
  print("this line is not sarcastic")
else:
  print("this line is sarcastic")

output on

If you have any doubts you could ask in the comment section. Criticism will be really helpful.

Fine-tuning Hugging Face Transformer using Keras API (TF)

News Headlines Dataset For Sarcasm Detection

High quality dataset for detection of sarcasm and fake news

sarcasm detection notebook by Vishakh Prakash

Written by Vishakh Prakash