Towards Automatic Text Summarization: Extractive Methods

Jan 23, 2019 · 8 min read

For those who had academic writing, summarization — the task of producing a concise and fluent summary while preserving key information content and overall meaning — was if not a nightmare, then a constant challenge close to guesswork to detect what the professor would find important. Though the basic idea looks simple: find the gist, cut off all opinions and detail, and write a couple of perfect sentences, the task inevitably ended up in toil and turmoil.

On the other hand, in real life we are perfect summarizers: we can describe the whole War and Peace in one word, be it “masterpiece” or “rubbish”. We can read tons of news about state-of-the-art technologies and sum them up in “Musk sent Tesla to the Moon”.

We would expect that the computer could be even better. Where humans are imperfect, artificial intelligence depraved of emotions and opinions of its own would do the job.

The story began in the 1950s. An important research of these days introduced a method to extract salient sentences from the text using features such as word and phrase frequency. In this work, Luhl proposed to weight the sentences of a document as a function of high frequency words, ignoring very high frequency common words –the approach that became the one of the pillars of NLP.

World-frequency diagram. Abscissa represents individual words arranged in order of frequency

By now, the whole branch of natural language processing dedicated to summarization emerged, covering a variety of tasks:

· headlines (from around the world);

· outlines (notes for students);

· minutes (of a meeting);

· previews (of movies);

· synopses (soap opera listings);

· reviews (of a book, CD, movie, etc.);

· digests (TV guide);

· biography (resumes, obituaries);

· abridgments (Shakespeare for children);

· bulletins (weather forecasts/stock market reports);

· sound bites (politicians on a current issue);

· histories (chronologies of salient events).

The approaches to text summarization vary depending on the number of input documents (single or multiple), purpose (generic, domain specific, or query-based) and output (extractive or abstractive).

Extractive summarization means identifying important sections of the text and generating them verbatim producing a subset of the sentences from the original text; while abstractive summarization reproduces important material in a new way after interpretation and examination of the text using advanced natural language techniques to generate a new shorter text that conveys the most critical information from the original one.

Obviously, abstractive summarization is more advanced and closer to human-like interpretation. Though it has more potential (and is generally more interesting for researchers and developers), so far the more traditional methods have proved to yield better results.

That is why in this blog post we’ll give a short overview of such traditional approaches that have beaten a path to advanced deep learning techniques.

By now, the core of all extractive summarizers is formed of three independent tasks:

1) Construction of an intermediate representation of the input text

There are two types of representation-based approaches: topic representation and indicator representation. Topic representation transforms the text into an intermediate representation and interpret the topic(s) discussed in the text. The techniques used for this differ in terms of their complexity, and are divided into frequency-driven approaches, topic word approaches, latent semantic analysis and Bayesian topic models. Indicator representation describes every sentence as a list of formal features (indicators) of importance such as sentence length, position in the document, having certain phrases, etc.

2) Scoring the sentences based on the representation

When the intermediate representation is generated, an importance score is assigned to each sentence. In topic representation approaches, the score of a sentence represents how well the sentence explains some of the most important topics of the text. In indicator representation, the score is computed by aggregating the evidence from different weighted indicators.

3) Selection of a summary comprising of a number of sentences

The summarizer system selects the top k most important sentences to produce a summary. Some approaches use greedy algorithms to select the important sentences and some approaches may convert the selection of sentences into an optimization problem where a collection of sentences is chosen, considering the constraint that it should maximize overall importance and coherency and minimize the redundancy.

Let’s have a closer look at the approaches we mentioned and outline the differences between them:

Topic Representation Approaches

Topic words

Frequency-driven approaches

Latent Semantic Analysis

Discourse Based Method

Bayesian Topic Models

Indicator representation approaches

Graph Methods

Machine Learning

Figure 1: Summary Extraction Markov Model to Extract 2 Lead Sentences and Additional Supporting Sentences
Figure 2: Summary Extraction Markov Model to Extract 3 Sentences

Yet, the problem with classifiers is that if we utilize supervised learning methods for summarization, we need a set of labeled documents to train the classifier, meaning development of a corpus. A possible way-out is to apply semi-supervised approaches that combine a small amount of labeled data along with a large amount of unlabeled data in training.

Overall, machine learning methods have proved to be very effective and successful both in single and multi-document summarization, especially in class-specific summarization such as drawing scientific paper abstracts or biographical summaries.

Though abundant, all the summarization methods we have mentioned could not produce summaries that would similar to human-created summaries. In many cases, the soundness and readability of created summaries are not satisfactory, because they fail to cover all the semantically relevant aspects of data in an effective way and afterwards they fail to connect sentences in a natural way.

In our next post, we’ll talk more about ways to overcome these problems and new approaches and techniques that have recently appeared in the field.


Sciforce Blog