Two minutes NLP — Four different approaches to Text Summarization

Word frequencies, TextRank, Sentence embeddings clustering, and seq-to-seq models

Fabio Chiusano
NLPlanet
3 min readJan 18, 2022

--

Small taxonomy of Text Summarization methods. Image by the author.

Text summarization is a technique for generating a concise and precise summary of long texts, without losing the overall meaning.

Text summarization approaches

Broadly, there are two approaches to summarizing texts in NLP: extraction-based and abstraction-based.

  • Extraction-based summarization: a subset of words or sentences that represent the most important points is pulled from the long text and combined to make a summary. The results may not be grammatically accurate.
  • Abstraction-based summarization: advanced deep learning techniques (mainly in seq-to-seq models) are applied to paraphrase and shorten the original document, just like humans do. Since abstractive machine learning algorithms can generate new phrases and sentences that represent the most important information from the source text, they can assist in overcoming the grammatical inaccuracies of the extraction-based techniques.

Although abstraction performs better at text summarization, developing its algorithms requires complicated deep learning techniques and sophisticated language modeling. As such, extractive text summarization approaches are still widely popular.

Let’s see distinct ways to implement extraction-based and abstraction-based summarization models.

Extraction-based summarization with word frequencies

This is probably the easiest way to implement an extraction-based summarizer.

  1. Clean the document removing stopwords, numbers, punctuation, and other special characters.
  2. Split the document into sentences.
  3. Count how many times each word appears in the document and divide it by the occurrences of the most frequent word that appears in the document, to obtain word frequencies.
  4. Sum the word frequencies of all the words that appear in the same sentence and obtain a score for each sentence.
  5. Keep the sentence with a score higher than a certain threshold and use them as the summary.

Extraction-based summarization with TextRank

This is how the TextRank algorithm works:

  1. Split the document into sentences.
  2. Get a sentence embedding for each sentence.
  3. Build a graph where nodes are the sentences and edge weights are the similarities (e.g. cosine similarity) of the sentence embeddings.
  4. Run the PageRank algorithm on the graph to obtain a PageRank score for each sentence. A high PageRank score means that the node is important for the network.
  5. Keep the sentences with a score higher than a certain threshold to use them as the summary of the document.

Extraction-based summarization with sentence embeddings and clustering

You can easily use this method with the bert-extractive-summarizer library. See the paper for more in-depth details.

  1. Resolve coreferences in the document (see Coreference Resolution).
  2. Split the document into sentences.
  3. Get a sentence embedding (e.g. using BERT) for each sentence.
  4. Run K-Means on the sentence embeddings and get K clusters. K will be the number of sentences in the summary.
  5. Find the closest sentence to the centroid of each cluster and use them to compose the summary.

Abstraction-based summarization with seq-to-seq models

You can leverage pre-trained models from Hugging Face.

  1. Get a dataset with documents paired with summaries, such as the Cornell Newsroom Dataset and the NewSHead Dataset.
  2. Choose a suitable metric for text summarization, such as ROUGE.
  3. Train a seq-to-seq model (usually a Transformer-based model) to produce summaries from texts in a supervised way.

--

--

Fabio Chiusano
NLPlanet

Freelance data scientist — Top Medium writer in Artificial Intelligence