Multi-Document Summarization with BART

12 min readJun 6, 2022

1. Introduction

Summarization is a central problem in Natural Language Processing with increasing applications as the desire to receive content in a concise and easily-understood format increases. Recent advances in neural methods for text summarization have largely been applied in the setting of single-document news summarization and headline generation. At one point or another, you’ve probably needed to summarize a document, be it a research article, a financial earnings report, or a thread of emails. If you think about it, this requires a range of abilities, such as understanding long passages, reasoning about the contents, and producing fluent text that incorporates the main topics from the original document. Moreover, accurately summarizing a news article is very different from summarizing a legal contract, so being able to do so requires a sophisticated degree of domain generalization.

For these reasons, text summarization is a difficult task for neural language models, including transformers. Despite these challenges, text summarization offers the prospect for domain experts to significantly speed up their workflows and is used by enterprises to condense internal knowledge, summarize contracts, automatically generate content for social media releases, and more.

To help you understand the challenges involved with summarization, let us fin-tune pretrained transformers model to summarize documents. In short Summarization is a classis sequence-to-sequence (seq2seq) task with an input text and a target text. This is where encoder-decoder transformers excel.

Now let us begin by taking a look at one of the canonical datasets for summarization called as Multi-News dataset.

2. The Dataset

Multi-News, consists of news articles and human-written summaries of these articles from the site newser.com. Each summary is professionally written by editors and includes links to the original articles cited. Containing 56,216 in SRC file format. For more info, click here. This dataset is notably the first large-scale dataset for MDS (Multi-Document Summarization)on news articles. The summaries here are notable long of about 260 words on average. While compressing information into a shorter text is the goal of summarization, this dataset tests the ability of abstractive models to generate fluent text concise in meaning while also coherent in the entirety of its generally longer output, which we consider an interesting challenge.

Table 1: An example from our multi-document summarization (MDS) dataset showing the input documents and their summary. Content found in the summary is color-coded.

Now let us dive into feature names which we have in this dataset.

from datasets import load_datasetdataset = load_dataset("multi_news")
print(f"Features: {dataset['train'].column_names}")
...
Features: ['document', 'summary']

The dataset has 2 columns: document, which containes news articles from multiple source as shown in figure Table 1, and summary which highlights the summaries from all these sources.

Let us explore one sample:

sample = dataset["train"][1]print(f"""Document (excerpt of 2000 characters, total length: {len(sample["document"])}):""")print(sample["document"][:2000])
print(f'\nSummary (length: {len(sample["summary"])}):')
print(sample["summary"])
...
Document (excerpt of 2000 characters, total length: 5389):
LOS ANGELES (AP) — In her first interview since the NBA banned her estranged husband, Shelly Sterling says she will fight to keep her share of the Los Angeles Clippers and plans one day to divorce Donald Sterling. 
  
 (Click Prev or Next to continue viewing images.) .....Summary (length: 501):
– Shelly Sterling plans "eventually" to divorce her estranged husband Donald, she tells Barbara Walters at ABC News. As for her stake in the Los Angeles Clippers, she plans to keep it, the AP notes. Sterling says she would "absolutely" fight any NBA decision to force her to sell the team. The team is her "legacy" to her family, she says. "To be honest with you, I'm wondering if a wife of one of the owners … said those racial slurs, would they oust the husband? Or would they leave the husband in?"

We see that the articles can be very long compared to the target summary. Long articles pose a challenge to most transformer models since the context size is usually limited to 1024 tokens or so, which is equivalent to a few paragraphs of text. The standard, yet crude way to deal with this for summarization is to simply truncate the texts beyond the model’s context size. Obviously there could be important information for the summary toward the end of the text, but for now we need to live with this limitation of the model architectures.

3. Training a BART Summarization Model

3.a Introduction to BART

BART ( Bidirectional and Auto-Regressive) from transformers is a sequence-to-sequence model trained as a denoising autoencoder. This means that a fine-tuned BART model can take a text sequence (for example, English) as input and produce a different text sequence at the output (for example, French). This type of model is relevant for machine translation (translating text from one language to another), question-answering (producing answers for a given question on a specific corpus), text summarization (giving a summary of or paraphrasing a long text document), or sequence classification (categorizing input text sentences or tokens). Another task is sentence entailment which, given two or more sentences, evaluates whether the sentences are logical extensions or are logically related to a given statement.

BART was trained as a denoising autoencoder, so the training data includes “corrupted” or “noisy” text, which would be mapped to clean or original text. So what exactly counts as “noisy” for text data. The authors of BART settle on using some existing and some new noising techniques for pretraining. The noising schemes they use are Token Masking, Token Deletion, Text Infilling, Sentence Permutation, and Document Rotation. However, not all transformations are employed in training the final BART model. Based on a comparative study of pre-training objectives, the authors use only text infilling and sentence permutation transformations, with about 30% of tokens being masked and all sentences permuted.

These transformations are applied to 160GB of text from the English Wikipedia and BookCorpus dataset. With this dataset, the vocabulary size is around 29000, and the maximum length of the sequences is 512 characters in the clean data.

3.b Using BART for summarization

Figure 2: Fine-tuning BART for Multi-Document Summarization. For more info, click here

For our multi-document summarisation, we will use sshleifer/distilbart-cnn-6–6 model for fine-tuning from huggingface site. We can find the model card for this model on the Hugging Face website, where we can also see that the model has been trained on two datasets: The CNN Dailymail dataset and the Extreme Summarization (XSum) dataset. The numbers 6 and 6 in the model name refer to the number of encoder layers and decoder layers, respectively.

We can import this model using transformers library:

from transformers import BartForConditionalGeneration, AutoTokenizermodel_ckpt = "sshleifer/distilbart-cnn-6-6"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = BartForConditionalGeneration.from_pretrained(model_ckpt)

3.c Fine-Tuning BART

Before we process the data for training, let us have a quick look at the length distribution of the input and outputs:

d_len = [len(tokenizer.encode(s)) for s in dataset["validation"]["document"]]s_len = [len(tokenizer.encode(s)) for s in dataset["validation"]["summary"]]

Note: Here, lengths are checked with samples from validation set to avoid delay in execution, but we can check the lengths with training set too by replacing dataset[“validation”] to dataset[“train].

fig, axes = plt.subplots(1, 2, figsize=(10, 3.5), sharey=True)
axes[0].hist(d_len, bins=20, color="C0", edgecolor="C0")
axes[0].set_title("Document Token Length")
axes[0].set_xlabel("Length")axes[0].set_ylabel("Count")
axes[1].hist(s_len, bins=20, color="C0", edgecolor="C0")
axes[1].set_title("Summary Token Length")
axes[1].set_xlabel("Length")
plt.tight_layout()
plt.show()
...

Figure 3: Documents token length vs Summaries Token Length

We see that most documents are much shorter with 1000-1500 tokens per document. Similarly, the summaries are much shorter, with around 250–300 tokens (the average length of a summary).

Let’s keep those observations in mind as we build the data collator for the Trainer. First we need to tokenize the dataset. For now, we will set the maximum lengths to 1024 and 256 for the documents and summaries, respectively:

def convert_examples_to_features(example_batch):
    input_encodings = tokenizer(example_batch["document"], max_length=1024, truncation=True)
    
    with tokenizer.as_target_tokenizer():
        target_encodings = tokenizer(example_batch["summary"], max_length=256, truncation=True)
        
    return {"input_ids": input_encodings["input_ids"], 
           "attention_mask": input_encodings["attention_mask"], 
           "labels": target_encodings["input_ids"]}dataset_pt = dataset.map(convert_examples_to_features, batched=True)

A new thing in the use of the tokenization step is the tokenizer.as_target_tokenizer() context. Some models require special tokens in the decoder inputs, so it’s important to differentiate between the tokenization of encoder and decoder inputs. In the with statement (called a context manager), the tokenizer knows that it is tokenizing for the decoder and can process sequences accordingly.

Now, we need to create the data collator. This function is called in the Trainer just before the batch is fed through the model. In most cases we can use the default collator, which collects all the tensors from the batch and simply stacks them. For the summarization task we need to not only stack the inputs but also prepare the targets on the decoder side. BART is an encoder-decoder transformer and thus has the classic seq2seq architecture. In a seq2seq setup, a common approach is to apply “teacher forcing” in the decoder. With this strategy, the decoder receives input tokens (like in decoder-only models such as GPT-2) that consists of the labels shifted by one in addition to the encoder output; so, when making the prediction for the next token the decoder gets the ground truth shifted by one as an input.

We shift it by one so that the decoder only sees the previous ground truth labels and not the current or future ones. Shifting alone suffices since the decoder has masked self-attention that masks all inputs at present and in the future.

So, when we prepare our batch, we set up the decoder inputs by shifting the labels to the right by one. After that, we make sure the padding tokens in the labels are ignored by the loss function by setting them to –100. We actually don’t have to do this manually, though, since the DataCollatorForSeq2Seq comes to the rescue and takes care of all these steps for us:

from transformers import DataCollatorForSeq2Seqseq2seq_data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

Note: Data collator actually dynamically pads the inputs received. For more info: click here

3.d Training the model

Let us setup the TrainingArguments for training.

from transformers import TrainingArguments, Trainertraining_args = TrainingArguments(output_dir='bart-multi-news', num_train_epochs=1, warmup_steps=500,                                  per_device_train_batch_size=1, per_device_eval_batch_size=1, weight_decay=0.01, logging_steps=10, push_to_hub=False, 
evaluation_strategy='steps', eval_steps=500, save_steps=1e6, 
gradient_accumulation_steps=16)

Here we are setting one new argument called as gradient_accumulation_steps to 16. Since the model is quite big, we had to set the batch size to 1. However, a batch size that is too small can hurt convergence. To resolve that issue, we can use a nifty technique called gradient accumulation. As the name suggests, instead of calculating the gradients of the full batch all at once, we make smaller batches and aggregate the gradients. When we have aggregated enough gradients, we run the optimization step. Naturally this is a bit slower than doing it in one pass, but it saves us a lot of GPU memory.

We have now everything we need to initialize the trainer with the model, tokenizer, training arguments, and data collator, as well as the training and evaluation sets:

trainer = Trainer(model=model, args=training_args, tokenizer=tokenizer,                  data_collator=seq2seq_data_collator,                  train_dataset=dataset_pt["train"],                  eval_dataset=dataset_pt["validation"])trainer.train()
...
TrainOutput(global_step=2810, training_loss=2.5926377455959115, metrics={'train_runtime': 9350.5084, 'train_samples_per_second': 4.81, 'train_steps_per_second': 0.301, 'total_flos': 4.560567352732877e+16, 'train_loss': 2.5926377455959115, 'epoch': 1.0})

4. Generating Multi-Document Summaries

Let us see what a summary generated on a sample from the test set looks like:

sample_text = dataset["test"][1]["document"]
reference = dataset["test"][1]["summary"]print("Document:")
print(sample_text)
print("\nReference Summary:")
print(reference)
...
Document:
UPDATE: 4/19/2001 Read Richard Metzger: How I, a married, middle-aged man, became an accidental spokesperson for gay rights overnight on Boing Boing 
 It’s time to clarify a few details about the controversial “Hey Facebook what’s SO wrong with a pic of two men kissing?” story, as it now beginning to be reported in the mainstream media, and not always correctly. 
 First of all, with regards to the picture: 
  
 The photo which was used to illustrate my first post about the John Snow Kiss-In is a promotional still from the British soap opera “Eastenders.” It features one of the main characters from the show (Christian Clarke, played by the actor John Partridge- left) and someone else who I don’t know. I am not a regular viewer so I can’t say if the man on the right is an extra or an actual character. 
  
 This picture has itself caused scandal in the UK, as it was a gay kiss that was broadcast before the watershed, and as such led to a number of complaints to the BBC. However, since this episode aired (October 2008) Christian now has a boyfriend and a few more gay kisses have taken place. 
  
 In relation to the John Snow Kiss-In event, I used this particular photo because I considered it to be quite mild (no groping, no tongues). The photos I had considered using before I chose that one are much more racy. Oh the irony! 
  
 Secondly, the removal of the Facebook John Snow Kiss-In event: 
  
 It turns out that the Facebook event for the John Snow Kiss-In was not blocked by Facebook, but made private by the creator of the event itself. Paul Shetler, the organizer, left this comment on the previous thread: .........

Reference Summary:
– It turns out Facebook is only guilty of about half of what it’s been accused of in the gay kiss incident. The social networking site apologized yesterday for taking down an image used to promote a “kiss-in” event in London. “The photo in question does not violate our Statement of Rights and Responsibilities, and was removed in error,” the site said in a statement, according to the Advocate. But Facebook did not, as has been reported in several places, take down the kiss-in event itself. Here’s what happened: The photo Facebook took down was posted by the Dangerous Minds blog to promote the event. In its initial write-up about the incident, the blog observed that the page organizing the protest had been taken down. But it was actually the organizer himself who "removed" the event, Dangerous Minds clarified. Organizer Paul Shetler explains that he decided to switch it from a public event to a private one, as "there were starting to be trolls posting abusive nonsense on it."

Converting text to tokens (with attention mask and padding) process is relatively easy.

input_ids = tokenizer(sample_text, max_length=1024, truncation=True, padding='max_length', return_tensors='pt').to(device)

Now we will want to use the built-in generate() function from transformers to explore more sophisticated decoding methods. Here we will keep max_length summary length to be generated as 256. The “device” enables us to specify the device type responsible to load a tensor into memory (‘cuda’ or ‘cpu’).

summaries = model.generate(input_ids=input_ids['input_ids'], attention_mask=input_ids['attention_mask'],                           max_length=256)

The summaries generated here will be in the form of ids. To convert that into text or words, let us use decode option from tokenizer.

decoded_summaries = [tokenizer.decode(s, skip_special_tokens=True, clean_up_tokenization_spaces=True) for s in summaries]

Displaying generated summary:

print("\nReference Summary:")
print(reference)
print("\nModel Summary:")
print(decoded_summaries[0])Reference Summary:
– It turns out Facebook is only guilty of about half of what it’s been accused of in the gay kiss incident. The social networking site apologized yesterday for taking down an image used to promote a “kiss-in” event in London. “The photo in question does not violate our Statement of Rights and Responsibilities, and was removed in error,” the site said in a statement, according to the Advocate. But Facebook did not, as has been reported in several places, take down the kiss-in event itself. Here’s what happened: The photo Facebook took down was posted by the Dangerous Minds blog to promote the event. In its initial write-up about the incident, the blog observed that the page organizing the protest had been taken down. But it was actually the organizer himself who "removed" the event, Dangerous Minds clarified. Organizer Paul Shetler explains that he decided to switch it from a public event to a private one, as "there were starting to be trolls posting abusive nonsense on it."

Model Summary:
– Facebook has removed a photo of two men kissing in protest of a London pub's decision to eject a same-sex couple for kissing, reports the Guardian. "The photo in question does not violate our Statement of Rights and Responsibilities and was removed in error," says a Facebook statement. "Shares that contain nudity, or any kind of graphic or sexually suggestive content, are not permitted on Facebook." The photo was used to promote a "gay kiss-in" demonstration in London, and was quickly removed from the Facebook page. The photo has prompted scores of people to post their own photos of the same sex couples kissing. "I am not a regular viewer so I can’t say if the man on the right is an extra or an actual character," says one commenter.

Note: You can also evaluate the generations as part of the training loop: use the extension of TrainingArguments called Seq2SeqTrainingArguments and specify predict_with_generate=True. Pass it to the dedicated Trainer called Seq2SeqTrainer, which then uses the generate() function instead of the model’s forward pass to create predictions for evaluation. Give it a try!

5. Conclusion

Text summarization poses some unique challenges compared to other tasks that can be framed as classification tasks, like sentiment analysis, named entity recognition, or question answering.

A common question when working with summarization models is how we can summarize documents where the texts are longer than the model’s context length. Unfortunately, there is no single strategy to solve this problem, and to date this is still an open and active research question.

6. References

Code Link on Kaggle: https://www.kaggle.com/code/ashwinnaidu/textsummarization/notebook