Text Summarization, Part 4 — Twitter bot for Automatic Summarization of Paper Abstracts

Published in

Besedo Engineering Blog

4 min readMar 24, 2022

The last three chapters (chapter 1, chapter 2 and chapter 3) were dedicated to the theoretical aspect and the underlying structure of various Text Summarization methods. This chapter shows a concrete application of this NLP sub-field for a specific purpose: Scientific Paper Summarization

The objective is to fit the main idea of the paper/article in one or two sentences, and share this summary in Twitter using a specifically engineered bot for this task.

This chapter presents the steps made to extract the data, compress it into a much smaller text, and format it to fit in less than 280 characters (the current length limit of a single tweet on Twitter).

Similar Work

While the Automatic Text Summarization field was quite a niche before, there has been a recent boost in the field thanks to the advent of the powerful Transformer architecture [1] as well as the release of python libraries such as HuggingFace, which make it easier to load, train and evaluate state-of-the-art models.

It then comes as no surprise that many solutions tackling the subject of Paper Summarization rely on A.I. This article, for example, presents an A.I powered solution capable of producing ‘science abstracts a second grader can understand’.

Here is a non-exhaustive list of popular frameworks using automatic text summarization methods for easier knowledge sharing :

Our Solution

Our implementation of a scientific paper summarizer, developed by the Besedo Data Science Research team during a workshop, is called paperTLDR, and consists of a Twitter Bot which periodically posts one-sentence summaries of popular scientific papers.

Step One: Data Extraction

We obtain the list of trending papers from the paperswithcode website.

The latter mainly shares trending papers from the ArXiv website, therefore, our Twitter bot will be summarizing articles published on ArXiv.

Step Two: Text Summarization Model

Before settling on a specific model, we have to choose a summarization strategy : Abstractive summarization or Extractive summatization ?

We choose to focus on abstractive summarization thanks to its flexibility when generating summaries, which usually consist of sequences of words not found in the original text, arranged in a way to maximize information output in a few sentences, contrary to extractive summarization, which only selects a couple of sentences directly from the original text.

(For a more detailed definition of abstractive/extractive summarization, see chapter 1)

In our case, since we are limited by the number of characters, extracting one or two sentences from the text isn’t enough to form a summary that encompasses the entire text’s context and meaning.

We also opted for abstractive summarization due to the extended control we can have on the summaries’ maximum and minimum length, whereas in extractive summarization, we are constrained by the number of extracted sentences, which can greatly vary in length.

For the model, we choose to use google’s PEGASUS model [2], whose weights are already finetuned on an ArXiv dataset. We download the model architecture and the weights using the HuggingFace library (Link to the model).

In order to evaluate this model, we tested it on the scientific_papers dataset, downloaded using the HuggingFace library (Link to the dataset).

The model was tasked to generate an abstract given a paper. We obtained the following results :

"rouge1": 40.32884297875105, "rouge2": 14.837473285509, "rougeL": 24.964702230256545

Which are encouraging results.

(For more information about ROUGE scores and other metrics to evaluate text summarization models, see chapter 1)

Those weights prove useful to compress abstracts and/or articles into a smaller version while retaining most of the scientific/technical language used in this kind of text.

Step Three: Tweeting the Summary

The tweet is made of three parts :

The arXiv paper link, formatted to 23 characters by Twitter
The compression rate: Indicates the length ratio between the summary and the abstract (from 0% to 100%). In the tweet, we use the compression emoji (🗜). For example, given an abstract of length = 500 and a generated summary of the length = 50, the compression rate would be equal to 500-50/500=90%.
The automatically generated summary uses appropriate parameters for min/max length in order to fit into Twitter’s character limit (equal to 280).

We provide two fail-safe measures if the entire tweet’s length somehow goes beyond the 280 character limit :

Segment the summary into sentences, and take the first one.
Truncate the tweet’s last 5 characters and append “[…]” at the end of the summary.

Here is an example of a tweeted summary:

arxiv.org/abs/2201.05273 🗜 91%
This survey we present recent advances achieved in the topic of paradigms for text generation.

The Twitter bot can be found here :

paperTLDR

Summarizing abstracts of trending papers from Arxiv and paperswithcode

twitter.com

Thanks for reading!

References

[1] : Vaswani et al. Attention Is All You Need. 2017. arXiv: 1706.03762

[2] : Jingqing Zhang et al. PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization. 2020. arXiv: 1912.08777