Fine-Tuning the Pre-Trained T5-Small Model in Hugging Face for Text Summarization
This is a series of short tutorials about using Hugging Face. The table of contents is here.
In the previous lesson 3.1, we learned how to use ChatGPT as a technical assistant to guide us in using datasets and models in Hugging Face for text summarization.
In this lesson, we will fine-tune the T5-small model on the California state bill subset of the Billsum dataset. We can also fine-tune other models, including Google’s PEGASUS model that we used in the previous lesson 3.1. However, for illustration, we only demonstrate the fine-tuning steps using the smaller model, t5-small, in this tutorial.
Install Transformers and Datasets from Hugging Face
! pip install transformers datasets
Load the BillSum dataset from Hugging Face
Let us load the BillSum dataset from the Hugging Face datasets library.
from datasets import load_dataset
billsum = load_dataset("billsum", split="ca_test")
The loaded billsum dataset only has one Dataset object. For fine-tuning and late evaluation, we should split the dataset into a train and test set with the train_test_split
method:
billsum = billsum.train_test_split(test_size=0.2)
Each instance in the datasets is a dictionary with 3 keys:
text
: the text for summarization.summary
: a given summary of the text.title
: the title of the text
Prepare the Data for Fine-Tuning the T5-Small Model
Text-To-Text Transfer Transformer (T5) is a pre-trained encoder-decoder model handling all NLP tasks as a unified text-to-text-format where the input and output are always text strings. T5-Small is the checkpoint with 60 million parameters.
For different tasks, we need to prepend different prefixes to the input to tell T5 what the task is. For example, if the task is to translate English text to German text, a prefix translate English to German
should be prepended to the input English text. For text summarization, we need to prepend the prefixsummarize
to the input text.
To prepare the billsum data for fine-tuning the t5-small model, we will prepend a prefix 'summarize:'
to the text field of each instance. We then load the T5 tokenizer to process text
and summary
.
The following code shows the prepending and tokenization for a single example:
example = billsum['train'][0]
pref_text = "summarize: " + example['text']
tokenized_text = tokenizer(pref_text)
tokenized_summary = tokenizer(example['summary'])
We will create a function preprocess_function(examples)
to preprocess the training and test data in batches. The function takes the entire dataset as the parameter examples
. It will return a dictionary model_inputs
with the fields generated by thetokenizer
and an additional labels
field as the targets. The function will perform the following actions:
- Prepend the prefix
summarize:
to each text field to tell the T5 model that the task at hand is summarization. - Convert the input texts and summary labels into a tokenized format that can be processed by the T5 model.
- Set the
max_length
parameter to ensure that the tokenized inputs and labels do not exceed a certain length, truncating any text that is too long. - Assign the tokenized labels to the labels field of
model_inputs
, which will be used during training to calculate the loss and optimize the model’s parameters.
The implementation of the functionpreprocess_function(examples)
is available in the accompanying Colab Notebook (see its link below).
We apply the preprocessing function to the entire dataset using the Huggingface Datasets map
method. We can speed up the map
function by setting batched=True
to process multiple elements of the dataset at once:
tokenized_billsum = billsum.map(preprocess_function, batched=True)
After the pre-processing, the tokenized_billsum
has the following structure:
DatasetDict({
train: Dataset({
features: ['text', 'summary', 'title', 'input_ids', 'attention_mask', 'labels'],
num_rows: 989
})
test: Dataset({
features: ['text', 'summary', 'title', 'input_ids', 'attention_mask', 'labels'],
num_rows: 248
})
})
The fine-tuning process will use the input_ids
and the labels
fields for training the model.
Define an Evaluation Metric for Training
We will utilize the ROUGE metric, which compares the quality of the summary text to the original text, to direct the training process. The Hugging Face Evaluate library provides an evaluation method for computing ROUGE scores. We will load the evaluation method from the Hugging Face library.
Install the evaluate
related packages:
!pip install evaluate rouge_score
Import evaluate
and create arouge
instance:
import evaluate
rouge = evaluate.load("rouge")
To use the rouge scores during training, we will create a function compute_metrics
that passes the predictions
and labels
as a parameter eval_pred
to calculate the ROUGE metric as follows:
- The
eval_pred
tuple is unpacked intopredictions
andlabels
. - The tokenizer’s
batch_decode
method is used to decode the tokenized predictions and labels back to text, skipping any special tokens like padding tokens. - The
np.where
function is used to replace any instances of -100 in the labels array with the tokenizer’spad_token_id
, as -100 is often used to signify tokens that should be ignored during loss calculation. - The rouge's
compute
method is called to calculate the ROUGE metric between the predictions and labels. - The length of each prediction is calculated by counting the number of non-padding tokens, and the mean prediction length is added to the result dictionary under the key
"gen_len"
. - Finally, the values in the result dictionary are rounded to 4 decimal places for cleaner output, and the result is returned.
Train for Fine-Tuning
For text summarization, we load AutoModelForSeq2SeqLM
, Seq2SeqTrainingArguments
, and Seq2SeqTrainer
classes from the Hugging Face transformers library:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer
We load the t5-small model as follows:
model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
We define training hyperparameters by creating an instance of Seq2SeqTrainingArguments
. Assign a value to the parameter output_dir
to specify the location to save the model. It is a required parameter.
training_args = Seq2SeqTrainingArguments(
output_dir="my_fine_tuned_t5_small_model",
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
weight_decay=0.01,
save_total_limit=3,
num_train_epochs=4,
predict_with_generate=True,
fp16=True,
)
We create a Seq2SeqTrainer
instance by passing the training arguments along with the model, dataset, tokenizer, data collator, and the compute_metrics
function:
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=tokenized_billsum["train"],
eval_dataset=tokenized_billsum["test"],
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)
Now, we train the model by calling the train()
method:
trainer.train()
After training, we can save the model as:
trainer.save_model("my_fine_tuned_t5_small_model")
Use the Fine-Tuned Model to Summarize Text
Now we have fine-tuned the t5-small model on the billsum dataset. We can use it for inference.
We will use an example from the test dataset.
text = billsum['test'][100]['text']
text = "summarize: " + text
The simplest way to try out your fine-tuned model for inference is to use it in a pipeline()
function. Create a pipeline
object for summarization with the fine-tuned model, and pass the text to it:
from transformers import pipeline
summarizer = pipeline("summarization", model="my_fine_tuned_t5_small_model")
pred = summarizer(text)
Evaluate the result
We can compute the rouge scores for the predicted summary compared to the given summary as follows:
preds = [pred[0]['summary_text']]
labels = [billsum['test'][100]['summary']]
rouge.compute(predictions=preds, references=labels, use_stemmer=True)
We received the following results:
{'rouge1': 0.22745098039215686,
'rouge2': 0.05905511811023622,
'rougeL': 0.12156862745098039,
'rougeLsum': 0.1647058823529412}
Great!! We have fine-tuned a pre-trained model in Hugging Face for text summarization.
The colab notebook is available here:
The table of contents of the entire course is here: https://medium.com/@anyuanay/tutorials-on-working-with-hugging-face-models-and-datasets-a01dea1f1a81