Generating Abstractive Summaries Using Google’s PEGASUS Model

Akash Chauhan
TheCyPhy
Published in
5 min readJun 25, 2020

--

In the last week of December 2019, Google Brain team launched this state of the art summarization model PEGASUS, which expands to Pre-training with Extracted Gap-sentences for Abstractive Summarization. In this article, we will just be looking at how we can generate summaries using the pre-trained model, for the information on how the pre-training took place, refer here.

As one could see in the original paper itself, it has been giving great abstractive summaries, for example, one of it’s fine-tuned model on XSum data, following happened for an input:

Fig 1: Example

Not bad for a machine generated summary, eh?

Coming to the point of this article, let’s see how we can use the given pre-trained model to generate summaries for our text. Since this is ongoing research, we do not have a method to get summaries for our text quickly. So until we do get this from the authors, the way in this article could be used.

As the first step, one needs to visit the GitHub repository and follow the steps mentioned in the documentation to install the library and download the model checkpoints. Be cautious about the way you install gsutil, as in linux distributions, some other package gets installed. The documentation is now updated so just make sure that you read through the steps cautiously.

Next step would be to install the dependencies mentioned in the requirements.txt. Cautiousness required here as well, keep track of the versions of the dependencies you are using. In my case, everything worked flawlessly with tensorflow version 1.15.

Great! So now that we are done with the setup, let’s get to the action. The pegasus directory appears in the following way:

Fig 2: pegasus cloned repository.

In the top-most directory named ckpt, we have our model checkpoint trained on C4 data. Along with that, you will find fine-tuned models on 12 tensorflow datasets. Refer to Fig 3.

Fig 3: Model checkpoints.

Everything seems to be fine till now. So, one can use any of these model checkpoints to generate summaries for their custom text. But wait before getting excited about these models, if one thinks of it, there must be some form in which the model expects the input right? So let’s work on creating the input data first.

The input needs to be a .tfrecord. So let’s just see how we are going to create our input data. The following piece of code ought to do it for you. Just one thing to take care of here, make sure the .tfrecord is saved inside the testdata directory, which is inside pegasus/data/.

In the gist above you will see that the targets are also passed. The list target is supposed to be the actual summary or the ground truth. Since we are only trying to generate summaries from the model and not train it, you can pass empty strings, but we can’t omit it because the model expects input in that format.

Awesome! Now that our data is prepared, there is just one more step and we start to get the summaries. So this step is to register our tfrecord in the registry of the pegasus(locally). Great! Let’s move forward. Just remember to keep track of the save_path from the code we used to generate the input data.

In the pegasus directory in your system, go to the path pegasus/params/public_params.py and paste the above code at the end of the script. In the above gist you will see that all the three; train_pattern, dev_pattern and test_pattern are assigned the same tfrecord, you may create different tfrecords for all three but since we are only looking to infer, it doesn’t matter. And we are done!

Toggle to the pegasus directory using your terminal and just run the command :

python3 pegasus/bin/evaluate.py --params=test_transformer \
--param_overrides=vocab_filename=ckpt/pegasus_ckpt/c4.unigram.newline.10pct.96000.model,batch_size=1,beam_size=5,beam_alpha=0.6 \
--model_dir=ckpt/pegasus_ckpt/

This will start to create your summaries for your input data. Once done you will see 3 text files created in the directory of the model that you pick. These three files correspond to the input text, target text and the predicted summaries.

You can open these text files and analyze the summaries. While you do, you might see that the summaries appear to be extractive rather than abstractive. That can be cured by fine-tuning the model with your data with a very small sample. See this note from the contributors.

Conclusion and Endnote:

This article consists of one of the workarounds to generate summaries from the pre-trained model provided by the Google Brain team for abstractive summarization, while it may not be a clean or efficient method but ought do the job until we get such functionality from the authors. If readers have some other way they could make use of these models for creating summaries, please comment or reach out.

Thank you so much for taking out time to read this article, find me at https://chauhanakash23.github.io/

--

--