Text Summarization, Part 1 — A gentle Introduction to Automatic Text Summarization

Published in

Besedo Engineering Blog

8 min readMar 3, 2022

This blog post series is about a notorious field that combines Artificial Intelligence and Linguistics: Text Summarization.

Although this blog post series is intended to be read by people with minimal knowledge about NLP (Natural Language Processing), this first introductory chapter can be read and understood by anyone curious enough about how a task such as Text Summarization, which is already tricky for humans to complete, is approached by an algorithm or AI.

The second and third chapters, however, delve a bit deeper into the State Of The Art concerning the various approaches and methods to achieve Automatic Text Summarization and may require general surface knowledge about the Transformers type neural network architecture.

If you have never heard of Transformers, I suggest you read Jay Alammar’s excellent article, which clearly introduces the concept - The Illustrated Transformer.

Definition

Text Summarization is the ability to write a shorter, condensed version of a paragraph, an article, or a book while retaining most of the original text’s meaning.

Illustration of the Text Summarization process

Automatic Text Summarization means automating that task without human intervention using algorithms, linguistic theorems, or artificial intelligence.

Natural Language Processing

In Artificial Intelligence, Automatic Text Summarization is a subcategory of NLP (Natural Language Processing).

NLP focuses on three essential points :

Turning raw text into mathematical features, also called representations (vectors, matrices…) which retain at least some of the text’s syntactic and/or semantic features and can be evaluated by an algorithm (Word Embeddings, POS-tagging…).
Statistical modeling to infer rules about language (Eg: Conditional Random Fields).
Using machine learning to train models capable of learning latent patterns in text features concerning a specific task (classification, generation…).

The most known example of turning words into representations is Word2Vec [7], which encodes words into vectors, as well as analogies between words (vector(“king”) − vector(“man”) + vector(“woman”) ≈ vector(“queen”))

Why is Automatic Text Summarization so Important?

Buddha said :
“The trouble is, you think you have time.”

In an increasingly complex and interconnected world, where knowledge is constantly created, updated, twisted, and deleted, there is simply no time to be updated on the state of the world, on your interests, or even on your area of expertise at work.

This is particularly dangerous in the era of social networks, where misinformation and fake news runs rampant. Most people don’t have time to learn about the complexity of the world and its subtle nuances and end up believing clickbaity headlines, then spreading false facts, which could be dangerous.

Automatic Text Summarization may be a solution for people to save a considerable amount of time in their work and keep updated on the state of the world by compressing news articles, technical documentation, books, essays, conferences, meetings to a much more digested format with minimal data loss.

A bit of History

Until the 21st century, it was inconceivable to conceptualize an AI engaging in abstractive automatic text summarization. The most it could do is extract relevant sentences from a text based on word frequency. [2]

Indeed, to truly summarize a text, one needs to have a deep understanding of the ideas it conveys and good skills in the language it was written in.

Recent advances in Word Embeddings and Recurrent Neural Networks, as well as the advent of the Transformer architecture, made this once unreachable goal possible.

Types of Automatic Text Summarization

There are two types of Automatic Text Summarization: Abstractive Summarization and Extractive Summarization.

Abstractive Summarization Pros and Cons :

(- -) Difficult to implement since we require the algorithm/AI to have a deep understanding of both the language and the text.
— (- -) Therefore, it needs a precise syntactic and semantic representation of text data.
(++) Summaries are concise and rich in information.
— (- -) But may contain factual errors due to AI/algorithm not grasping the text’s context well enough.

Extractive Summarization Pros and Cons :

(++) Can be done with a variety of methods. Some methods don’t even require machine learning (see last section -A bit of History-).
(++) Suited to cases where key sentences hold most of the text’s information, such as news articles.
— (- -) By the same logic, it is not suitable to cases where information is thinly spread over the text.
(++)The summaries are factually and grammatically correct.
— (- -) but some extracted sentences may lack the overall context of the text.

For example, let’s summarize this article [1] :

A summary generated using the abstractive method would look like this by reformulating the critical point of the article :

Summary (Abstractive method)

While the extractive method would yield this summary by extracting the first and second-to-last sentences:

Evaluation Metrics of Automatic Text Summarization

How do we evaluate a summary’s quality?

It was challenging to find suitable metrics to numerically qualify the faithfulness of a generated summary to the original text for a long time. After all, objectively, there is no “best” summary for a given text.

This is why to avoid over-complicating things and to have a single standard for everyone to agree on, researchers proposed simple solutions by directly comparing two raw texts: The original summary, written by the human, and the predicted/generated/extracted summary by the algorithm.

Some of the metrics used in Automatic Text Summarization

ROUGE

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) [3] is a metric specially made for Automatic Text Summarization.

Most papers use ROUGE-N (ROUGE-1, ROUGE-2) and ROUGE-L for evaluating their methods/models.

ROUGE-N is a recall-inspired metric calculated by dividing the number of N-grams in common between the reference (human-written) summary and the AI-written summary and the total number of N-grams of the reference summary.

Therefore, ROUGE-1 only targets the summary’s uni-grams and ROUGE-2 the bi-grams.

But sometimes, ROUGE-N isn’t enough to truly compute the similarity between two sentences while taking the n-grams order and/or meaning into account.

For instance, let’s take the following example :

S1 (reference): journalist answered the guest
S2 (candidate): journalist answer the guest
S3 (candidate): the guest answered the journalist

ROUGE 2 (S1, S2) = len([“the guest”])/len([“journalist answered”, “answered the”, “the guest”]) = 1/3
ROUGE 2 (S1, S3) = len([“the guest”])/len([“journalist answered”, “answered the”, “the guest”]) = 1/3

We notice that ROUGE 2 (S1, S2) = ROUGE 2 (S1, S3), even though S2 and S3 have opposite meanings.

That’s why, instead of computing ROUGE-N for multiple values of N, we prefer to compute only ROUGE-1, ROUGE-2, and add an additional metric for longer sentences, called ROUGE-L.

ROUGE-L replaces the N-gram counting by the LCS (Longest Common Subsequence), which is much more robust against cases such as the previous example (More details in the paper [3]). The metric is computed as an F1-score between two measures.

m being the reference summary’s length, n the predicted summary’s length, and β a constant between 0 and 1.

BLEU

BLEU (Bilingual Evaluation Understudy) [4] is an evaluation metric mainly used for automatic translation.

It is computed by calculating a weighted geometric mean of N precision scores (pn) (usually uni-grams, bigrams, trigrams, and 4-grams), multiplied by a shortness penalty (BP) to penalize short translations.

Each precision score is computed by dividing the number of N-grams in common with the reference (human-written) summary and the AI-written summary, and the total number of N-grams of the A.I-written summary.

Even though BLEU is intended for evaluating the quality of automatic translations, a study carried out in 2003 [1] recognized the effectiveness of BLEU in the field of automatic text summary by discovering a significant correlation between the metric and scores based on human judgment.

Limitations of ROUGE/BLEU

ROUGE and BLEU rely on word/n-gram frequency to compute a similarity measure between two raw texts. This similarity is purely syntactic and doesn’t consider the individual meaning of the n-grams, or the overall purpose of the text.

This may cause a problem in the context of Automatic Text Summarization, where one document can have multiple summaries, each one being worded differently from the generated summary, which leads to low ROUGE and BLEU scores, even though all summaries are valid.

Research is done to find more suitable metrics for raw text comparison, either by giving more weight to some words/n-grams depending on their rarity (such as NIST [5]) or by giving more importance to word/n-gram alignment and order in the compared texts (such as METEOR -Metric for Evaluation of Translation with Explicit ORdering- [6]).

Conclusion

Automatic Text Summarization is an ever-growing field in the world of NLP. I hope this first chapter has laid a good foundation for understanding this domain and cleared some doubts about the current methods and metrics for achieving this task.

The next chapter will be more technical as we delve right into the heart of the domain by presenting the state-of-the-art methods for Automatic Text Summarization.

Link to Part 2

References

[1] https://edition.cnn.com/travel/article/space-tourism-20-year-anniversary-scn/index.html

[2] H. P. Luhn. “The Automatic Creation of Literature Abstracts.” In: IBM J. Res. Dev. 2 (1958), p. 159–165.

[3] Chin-Yew Lin. “ROUGE: A Package for Automatic Evaluation of Summaries.” In :
Text Summarization Branches Out. Barcelona, Spain: Association for Computational Linguistics, juily. 2004, p. 74–81.

[4] Chin-Yew Lin et E. Hovy. “Automatic Evaluation of Summaries Using N-gram Cooccurrence Statistics.” In: HLT-NAACL. 2003.

[5] Doddington, George “Automatic Evaluation of Machine Translation Quality Using N-Gram Co-Occurrence Statistics,” In Morgan Kaufmann Publishers Inc. 2002

[6] Banerjee, S. and Lavie, A. (2005) “METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments” in Proceedings of Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization at the 43rd Annual Meeting of the Association of Computational Linguistics (ACL-2005), Ann Arbor, Michigan, June 2005

[7] Tomas Mikolov and Kai Chen and Greg Corrado and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space, arXiv:1301.3781, 2013