Utilizing ChatGPT to Navigate the Usage of Hugging Face for Text Summarization

Yuan An, PhD
9 min readOct 18, 2023

--

This is a series of short tutorials about using Hugging Face. The table of contents is here.

For every data scientist, we now have access to two powerful toolsets: (1) ChatGPT for text and data processing, and (2) Hugging Face for libraries of transformers and datasets. Despite the exceptional ability of ChatGPT for text summarization, there are concerns about privacy and subscription limits. On the other hand, Hugging Face provides a vast number of open-source transformers. However, building a tailored text summarization approach using Hugging Face requires sifting through the Hugging Face libraries. The good news is that we can exploit ChatGPT to browse the Hugging Face libraries and offer suggestions.

In this lesson, we will learn how to use ChatGPT as a technical assistant when we need to summarize text using models in Hugging Face.

Load a Data Set

We first asked ChatGPT to find the most popular dataset from Huging Face for text summarization and write Python code to download it.

ChatGPT browsed Hugging Face and found two datasets in Hugging Face that are used for text summarization tasks: the BillSum dataset and the CNN/DailyMail dataset. We then asked ChatGPT to write code to download the BillSum dataset.

! pip install datasets

from datasets import load_dataset

# Load the dataset
billsum = load_dataset("billsum", split="ca_test")

Find the Most Popular Model in Hugging Face for Text Summarization

We asked ChatGPT to find the most popular model in Hugging Face for text summarization. ChatGPT returned a list of models:

  • T5-small Fine-tuned Models
  • PEGASUS Model by Google
  • BART Model

We decided to use the PEGASUS model and asked ChatGPT to write code to use the model for text summarization.

pip install transformers

import torch
from transformers import PegasusForConditionalGeneration, PegasusTokenizer


# Load the pre-trained PEGASUS model and tokenizer
model_name = 'google/pegasus-large' # You can replace this with 'google/pegasus-billsum' if it becomes available
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name)

Since we were not familiar with the PegasusForConditionalGeneration model, we asked ChatGPT to tell us more about it. ChatGPT browsed the website and provided the following detailed explanation:

PegasusForConditionalGeneration is a class provided by the Hugging Face Transformers library for working with the PEGASUS model. PEGASUS stands for Pre-training with Extracted Gap-sentences for Abstractive SUmmariZation Sequence-to-sequence. It's a model designed specifically for text summarization tasks, and it's built on the Transformer architecture.

Here are some key points about PegasusForConditionalGeneration and the PEGASUS model:

Pre-training Technique: PEGASUS employs a novel pre-training objective called gap sentence generation. During pre-training, certain sentences are removed (masked) from the input text, and the model learns to generate these masked sentences. This pre-training strategy aligns well with the summarization task where the model needs to generate a concise summary from the input text.

Architecture: PEGASUS utilizes the Transformer’s encoder-decoder architecture, which is effective for sequence-to-sequence tasks like summarization. The encoder processes the input text, and the decoder generates the summary.

Conditional Generation: The term “Conditional Generation” in PegasusForConditionalGeneration refers to the model's ability to generate text conditionally based on the input text. The model generates a summary based on the context provided by the input text.

Implementation in Hugging Face: PegasusForConditionalGeneration is implemented in the Hugging Face Transformers library, which provides a high-level, easy-to-use interface for working with the PEGASUS model. With this class, you can load pre-trained PEGASUS models, fine-tune them on your data, and generate summaries from input text.

Usage: The PegasusForConditionalGeneration class provides methods for loading pre-trained models (from_pretrained), fine-tuning (train), and generating text (generate). It also includes configurations for handling different variants of the PEGASUS model.

Tokenization: Similar to other models in the Transformers library, PEGASUS relies on tokenization to convert input text into a format that can be processed by the model. The library provides a separate tokenizer class (PegasusTokenizer) for this purpose.

Customization and Fine-tuning: You can fine-tune pre-trained PEGASUS models on your data using the PegasusForConditionalGeneration class. This is often done to adapt the model to specific summarization tasks or domains.

By using the PegasusForConditionalGeneration class, you can leverage the power of the PEGASUS model for text summarization tasks in a straightforward and Pythonic way.

When we ran the code to create the pre-trained model, an error indicated that the package ‘SentencePiece’ was missing. We had to restart the runtime and install the package. We asked ChatGPT to explain the package ‘SentencePiece’. ChatGPT provided detailed information as follows:

PegasusForConditionalGeneration, like many other models within the Hugging Face Transformers library, utilizes SentencePiece for tokenization. SentencePiece is a data-driven, unsupervised text tokenizer, and detokenizer mainly for Neural Network-based text processing tasks. Here are the reasons why SentencePiece is utilized with PegasusForConditionalGeneration:

Vocabulary Generation: SentencePiece helps in building a subword vocabulary from the training data. Unlike word-based tokenization, subword tokenization helps in handling a wide range of vocabulary, including out-of-vocabulary words, by breaking down words into smaller pieces.

Language Agnostic Tokenization: SentencePiece is designed to be language-agnostic and can handle multiple languages without the need for pre-segmented text, making it suitable for multilingual models like PEGASUS.

Effective Handling of Rare Words: In subword tokenization, rare or out-of-vocabulary words are broken down into smaller, known subwords. This is beneficial in text summarization tasks, especially when dealing with rare terms or names that may not appear in the pre-trained vocabulary.

Consistency in Tokenization across Models: Using SentencePiece provides a consistent tokenization approach across different models. This consistency is crucial when switching between different models or working with models pre-trained on different datasets but tokenized using SentencePiece.

Efficient and Compact Representation: SentencePiece tokenization results in a more compact representation of text, which can be beneficial in reducing the memory and computational requirements of the model.

Easy Integration with Hugging Face: SentencePiece has been well-integrated within the Hugging Face Transformers library, making it a convenient choice for tokenization when working with models like PEGASUS.

End-to-End Framework: SentencePiece provides an end-to-end, trainable tokenization framework. It allows for training a custom tokenizer on new data, which can be advantageous when working with domain-specific text or languages not well covered by existing pre-trained tokenizers.

The use of SentencePiece with PegasusForConditionalGeneration ensures that the text is tokenized effectively and consistently, which is crucial for achieving good performance in text summarization tasks.

Great! We had better knowledge about the model and the required packages. We asked ChatGPT to use the model to summarize a piece of text from the billsum dataset. ChatGPT provided the following instructions:

  1. Choose a text from the BillSum dataset to summarize:
# Assume the text to summarize is in the 'text' column of the first entry of the dataset
text_to_summarize = billsum[0]['text']

2. Tokenize the text and generate the summary:

# Tokenize the text
inputs = tokenizer(text_to_summarize, truncation=True, return_tensors="pt", max_length=512)

3. Generate the summary

summary_ids = model.generate(inputs.input_ids, num_beams=4, length_penalty=2.0, max_length=150, min_length=40)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

4. Print the summary

print(summary)

We were not familiar with the options in the tokenizer() function, so we asked ChatGPT to explain them in detail:

truncation=True: Truncation is the process of cutting down the text to a specified maximum length. By setting truncation=True, you are instructing the tokenizer to truncate the text if it exceeds the specified maximum length (max_length=512 in this case).

return_tensors="pt": This argument instructs the tokenizer to return the tokenized text as PyTorch tensors ("pt" stands for PyTorch). The other possible values for this argument are "tf" for TensorFlow tensors, and "np" for NumPy arrays. By default, the tokenizer returns tokenized text as lists, but since PEGASUS and other models in the Hugging Face library work with tensors, you need to specify return_tensors="pt" to get the tokenized text in the correct format.

max_length=512: This argument specifies the maximum length of the tokenized text. If the text has more tokens than this maximum length, it will be truncated down to this length if truncation=True. The value 512 is often used as it's a common maximum sequence length for many transformer models, but you may need to adjust this value based on your specific model and task.

We were not familiar with the options in the model.generate() function, so we asked ChatGPT to explain them in detail:

num_beams=4: This argument sets the number of beams in beam search, which is a search algorithm used to generate text. Beam search keeps track of the num_beams most likely sequences at each step, which helps in finding a more optimal sequence of tokens compared to a greedy search. A higher value of num_beams will result in better quality summaries, but will also increase the time and computational resources required to generate the summary.

length_penalty=2.0: The length_penalty argument is used to control the length of the generated summary. A value greater than 1.0 encourages the model to generate longer sequences, while a value less than 1.0 encourages the model to generate shorter sequences.

max_length=150: This argument sets the maximum length of the generated summary. If the generated summary reaches this length, the generation process will stop.

min_length=40: This argument sets the minimum length of the generated summary. The generation process will continue until at least this length is reached.

We were also not familiar with the statement summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True), so we asked ChatGPT to explain it in detail:

decode Method: The decode method is used to convert a sequence of token IDs back into a string of text. This is the reverse process of tokenization.

summary_ids[0]: summary_ids is an array of token IDs generated by the model, where each token ID corresponds to a word or subword in the generated summary. summary_ids[0] refers to the first (and typically the only) sequence of token IDs in summary_ids, which represents the generated summary.

skip_special_tokens=True: This argument instructs the decode method to skip any special tokens when converting the token IDs back into text. Special tokens are tokens used by the model for specific purposes, such as marking the beginning or end of a sequence. By setting skip_special_tokens=True, you ensure that these special tokens are not included in the decoded text, resulting in a cleaner, more readable summary.

Evaluate the Summarization Result

After the model generated the summary, we asked ChatGPT to write code to evaluate the generated summary compared to the given summary. ChatGPT suggested to use the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metric, which is commonly used for evaluating text summarization.

ChatGPT told us to install the rouge library:

pip install rouge

ChatGPT told us to import the Rouge() function and compute the rouge scores as follows:

from rouge import Rouge

given_summary = billsum[0]['summary']

# Evaluate the generated summary against the given summary using the ROUGE metric
rouge = Rouge()

scores = rouge.get_scores(generated_summary, given_summary, avg=True)

print(scores)

We got the following results:

{'rouge-1': {'r': 0.25773195876288657, 'p': 0.30864197530864196, 'f': 0.28089887144489334}, 
'rouge-2': {'r': 0.03896103896103896, 'p': 0.05357142857142857, 'f': 0.04511277707954149},
'rouge-l': {'r': 0.20618556701030927, 'p': 0.24691358024691357, 'f': 0.22471909616399455}}

Finally, we asked ChatGPT to explain and assess the results in reference to the state-of-the-art scores:

Here’s what each of the numbers in your results represents:

ROUGE-1 Scores:

Recall (r): 0.2577 - This is the proportion of unigrams (individual words) in the reference summary that also appear in the generated summary.

Precision (p): 0.3086 - This is the proportion of unigrams in the generated summary that also appear in the reference summary.

F1-score (f): 0.2809 - This is the harmonic mean of precision and recall, which provides a balanced measure of both properties.

ROUGE-2 Scores:

Recall (r): 0.0390 - This is the proportion of bigrams (two-word sequences) in the reference summary that also appear in the generated summary.

Precision (p): 0.0536 - This is the proportion of bigrams in the generated summary that also appear in the reference summary.

F1-score (f): 0.0451 - This is the harmonic mean of precision and recall for bigrams.

ROUGE-L Scores:

Recall (r): 0.2062 - This is the proportion of the longest common subsequence (LCS) between the reference and generated summaries.

Precision (p): 0.2469 - This is the proportion of the LCS in the generated summary.

F1-score (f): 0.2247 - This is the harmonic mean of precision and recall for the LCS.

Assessment:

The ROUGE-1 and ROUGE-L scores are relatively low, indicating that there might be room for improvement in the model’s ability to capture key information from the original text. The ROUGE-2 scores are notably lower, which suggests that the model may be struggling to capture consecutive word pairings found in the reference summaries.

When comparing these scores to state-of-the-art (SOTA) models, these scores are lower. As of the last few years, SOTA models on some summarization tasks have achieved ROUGE-1 F1-scores above 0.4 and ROUGE-2 F1-scores above 0.2. However, the exact SOTA scores can vary depending on the specific dataset and task.

The evaluation might be impacted by various factors including the quality of the dataset, the domain specificity of the text, the length of the summaries, and the training and fine-tuning of the model.

Terrific! We leveraged ChatGPT to guide us step-by-step to use a popular model in Hugging Face for text summarization. For the technical details that were unfamiliar to us, we also asked ChatGPT to explain them and learned from it.

Disclaimer: We used the ‘Browse with Bing’ function of ChatGPT for creating the content. We didn’t manually verify the quality and accuracy of the generated content. Users should exercise their own discretion when learning under ChatGPT’s guidance.

The colab notebook is available here:

--

--

Yuan An, PhD

Faculty member in the College of Computing and Informatics at Drexel University; Doing research in NLP, Machine Learning, Ontology, Knowledge Graph, Embeddings