Fine-Tuning the Pre-Trained T5-Small Model in Hugging Face for Text Summarization

Yuan An, PhD
5 min readOct 22, 2023

--

This is a series of short tutorials about using Hugging Face. The table of contents is here.

In the previous lesson 3.1, we learned how to use ChatGPT as a technical assistant to guide us in using datasets and models in Hugging Face for text summarization.

In this lesson, we will fine-tune the T5-small model on the California state bill subset of the Billsum dataset. We can also fine-tune other models, including Google’s PEGASUS model that we used in the previous lesson 3.1. However, for illustration, we only demonstrate the fine-tuning steps using the smaller model, t5-small, in this tutorial.

Install Transformers and Datasets from Hugging Face

! pip install transformers datasets

Load the BillSum dataset from Hugging Face

Let us load the BillSum dataset from the Hugging Face datasets library.

from datasets import load_dataset

billsum = load_dataset("billsum", split="ca_test")

The loaded billsum dataset only has one Dataset object. For fine-tuning and late evaluation, we should split the dataset into a train and test set with the train_test_split method:

billsum = billsum.train_test_split(test_size=0.2)

Each instance in the datasets is a dictionary with 3 keys:

  • text: the text for summarization.
  • summary: a given summary of the text.
  • title: the title of the text

Prepare the Data for Fine-Tuning the T5-Small Model

Text-To-Text Transfer Transformer (T5) is a pre-trained encoder-decoder model handling all NLP tasks as a unified text-to-text-format where the input and output are always text strings. T5-Small is the checkpoint with 60 million parameters.

For different tasks, we need to prepend different prefixes to the input to tell T5 what the task is. For example, if the task is to translate English text to German text, a prefix translate English to Germanshould be prepended to the input English text. For text summarization, we need to prepend the prefixsummarize to the input text.

To prepare the billsum data for fine-tuning the t5-small model, we will prepend a prefix 'summarize:' to the text field of each instance. We then load the T5 tokenizer to process text and summary .

The following code shows the prepending and tokenization for a single example:

example =  billsum['train'][0]
pref_text = "summarize: " + example['text']

tokenized_text = tokenizer(pref_text)
tokenized_summary = tokenizer(example['summary'])

We will create a function preprocess_function(examples)to preprocess the training and test data in batches. The function takes the entire dataset as the parameter examples . It will return a dictionary model_inputs with the fields generated by thetokenizerand an additional labels field as the targets. The function will perform the following actions:

  • Prepend the prefix summarize: to each text field to tell the T5 model that the task at hand is summarization.
  • Convert the input texts and summary labels into a tokenized format that can be processed by the T5 model.
  • Set the max_length parameter to ensure that the tokenized inputs and labels do not exceed a certain length, truncating any text that is too long.
  • Assign the tokenized labels to the labels field of model_inputs, which will be used during training to calculate the loss and optimize the model’s parameters.

The implementation of the functionpreprocess_function(examples) is available in the accompanying Colab Notebook (see its link below).

We apply the preprocessing function to the entire dataset using the Huggingface Datasets mapmethod. We can speed up the map function by setting batched=True to process multiple elements of the dataset at once:

tokenized_billsum = billsum.map(preprocess_function, batched=True)

After the pre-processing, the tokenized_billsum has the following structure:

DatasetDict({
train: Dataset({
features: ['text', 'summary', 'title', 'input_ids', 'attention_mask', 'labels'],
num_rows: 989
})
test: Dataset({
features: ['text', 'summary', 'title', 'input_ids', 'attention_mask', 'labels'],
num_rows: 248
})
})

The fine-tuning process will use the input_ids and the labels fields for training the model.

Define an Evaluation Metric for Training

We will utilize the ROUGE metric, which compares the quality of the summary text to the original text, to direct the training process. The Hugging Face Evaluate library provides an evaluation method for computing ROUGE scores. We will load the evaluation method from the Hugging Face library.

Install the evaluate related packages:

!pip install evaluate rouge_score

Import evaluate and create arouge instance:

import evaluate

rouge = evaluate.load("rouge")

To use the rouge scores during training, we will create a function compute_metricsthat passes the predictions and labels as a parameter eval_predto calculate the ROUGE metric as follows:

  • The eval_pred tuple is unpacked into predictions and labels .
  • The tokenizer’s batch_decode method is used to decode the tokenized predictions and labels back to text, skipping any special tokens like padding tokens.
  • The np.where function is used to replace any instances of -100 in the labels array with the tokenizer’s pad_token_id, as -100 is often used to signify tokens that should be ignored during loss calculation.
  • The rouge's compute method is called to calculate the ROUGE metric between the predictions and labels.
  • The length of each prediction is calculated by counting the number of non-padding tokens, and the mean prediction length is added to the result dictionary under the key "gen_len".
  • Finally, the values in the result dictionary are rounded to 4 decimal places for cleaner output, and the result is returned.

Train for Fine-Tuning

For text summarization, we load AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, and Seq2SeqTrainer classes from the Hugging Face transformers library:

from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer

We load the t5-small model as follows:

model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")

We define training hyperparameters by creating an instance of Seq2SeqTrainingArguments. Assign a value to the parameter output_dir to specify the location to save the model. It is a required parameter.

training_args = Seq2SeqTrainingArguments(
output_dir="my_fine_tuned_t5_small_model",
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
weight_decay=0.01,
save_total_limit=3,
num_train_epochs=4,
predict_with_generate=True,
fp16=True,
)

We create a Seq2SeqTrainer instance by passing the training arguments along with the model, dataset, tokenizer, data collator, and the compute_metrics function:

trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=tokenized_billsum["train"],
eval_dataset=tokenized_billsum["test"],
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)

Now, we train the model by calling the train() method:

trainer.train()

After training, we can save the model as:

trainer.save_model("my_fine_tuned_t5_small_model")

Use the Fine-Tuned Model to Summarize Text

Now we have fine-tuned the t5-small model on the billsum dataset. We can use it for inference.

We will use an example from the test dataset.

text = billsum['test'][100]['text']
text = "summarize: " + text

The simplest way to try out your fine-tuned model for inference is to use it in a pipeline()function. Create a pipeline object for summarization with the fine-tuned model, and pass the text to it:

from transformers import pipeline

summarizer = pipeline("summarization", model="my_fine_tuned_t5_small_model")
pred = summarizer(text)

Evaluate the result

We can compute the rouge scores for the predicted summary compared to the given summary as follows:

preds = [pred[0]['summary_text']]

labels = [billsum['test'][100]['summary']]

rouge.compute(predictions=preds, references=labels, use_stemmer=True)

We received the following results:

{'rouge1': 0.22745098039215686,
'rouge2': 0.05905511811023622,
'rougeL': 0.12156862745098039,
'rougeLsum': 0.1647058823529412}

Great!! We have fine-tuned a pre-trained model in Hugging Face for text summarization.

The colab notebook is available here:

--

--

Yuan An, PhD

Faculty member in the College of Computing and Informatics at Drexel University; Doing research in NLP, Machine Learning, Ontology, Knowledge Graph, Embeddings