A Qualitative Introduction to Automatic Text Summarization

Published in

The AI Herald

8 min readJun 4, 2018

In the era of Information explosion (IE), we are surrounded by tons of information. In the simple case of reading the daily news, it is just impractical to go through tens of pages of daily news articles talking about the same thing. We all have been bugged and frustrated by this experience at some point, right? Like all the other things in the world, Information Explosion is both a boon and a bane.

Graph depicting plot of quantity of data against the year, available on the World Wide Web(picture courtesy: IDC)

Research from IDC (International Data Corporation) shows that the volume of digital data is expected to touch 40,000 Exabytes in 2020(for reference, 1 Exa Byte=1 million TeraBytes!). With so much information available on the internet, it is just not possible to go through every information source in complete detail.

So what can we do about it?

Automatic Summarization is our answer to Information Explosion. Let us see how!

Definition

Before knowing the nits and grits of automatic text summarization, let us understand what exactly is meant by “automatic” summarization? According to Wikipedia:

“ Automatic summarization is the process of shortening a text document with software, in order to create a summary with the major points of the original document”

Pretty lucidly defined! Anyway, here is my version of the definition:

The task of automatic summarization is to reduce the information to be consumed by humans by only extracting the core content of the information, which could give us the overall gist of the complete information.

Document summarization (picture courtesy: sflscientific.com)

Methods of Summarization

There are two methods for doing automatic text summarization:-

Extractive
Abstractive

Extractive summarization works by directly extracting the important sentences as it is from the document by using statistic, linguistic, and/or graphical based approaches.

Abstractive summarization, on the other hand, uses different models to deduce the crux of the document, and then present a summary consisting of words and phrases that were not there in the actual document.

Diagrammatic explanation of Extractive and Abstractive Text Summarization. The red lines are extracted as it is from the text, whereas the blue lines are curated by using advanced NLP Techniques.

Abstractive summarization is closer to how a human would summarize a large document. However, it is not widely used because implementing it is hard. This is because abstractive summarization methods use advanced natural language processing techniques, and therefore cope with problems such as semantic representation, inference and natural language generation which is relatively harder than merely extracting sentences from the document.

Preprocessing

Before we can apply different summarization approaches on a dataset, we need to do certain preprocessing to make the data “usage-ready” for our summarizer. The importance of preprocessing procedure is evident because of its use in almost every developed system related to text processing. Following are some of the widely used preprocessing steps for text-processing:-

Sentence Tokenization, also known as Sentence Boundary Disambiguation, is the process of identifying where the sentences start and end. It is used to treat individual sentences as separate entities and makes further processing of the text relatively easy.

Tokenization helps in treating individual sentences as separate entities

2. Cleaning is done to remove special characters from the text and to replace them with spaces. As a result, it simplifies the text for analysis purposes.

3.Case Conversion is the process of changing all the characters in the document to either lower-case or upper-case for uniformity purposes.

4. Word tokenizer tokenizes words into separate entities within a sentence. This step is especially important if you want to calculate the feature scores of individual words of a sentence for deducing important sentences in a document.

Tokenization of words is a necessary step for feature score calculation. Here s_i represents the ith sentence

5. Stop word removal is the process of removing stop words. (words which do not convey any information, such as “and”, “the”, “it” etc. which are insignificant in feature score calculation). Since they are deemed unnecessary and have no significance on their own, they must be removed to simplify the task of the summarizer.

6. Stemming is the process of changing the linguistically derived word to its base/root word. For example: “walking”, “walks” and “walked” essentially refer to the same root word, “walk”. So a stemming algorithm will reduce all derivatives of the word to “walk”. Converting the derived words into the root words further simplifies the task deducing the crux of the document.

7. POS(Part-of-Speech) Tagging is used to identify part-of-speech of words such as adjective, noun, adverb, verb, etc. although generally computational applications use more fine-grained POS tags like ‘noun-plural’.

Basic Architecture of a Text Summarization System

Approaches To Extractive Summarization

Basically put, there are four main approaches to extractive text summarization:-

Statistical approach
Semantic approach
Graph-Based approach
Fuzzy Logic Based Approach

Statistical approach uses different statistical models to represent information. It does not try to understand the meaning behind the sentences, and chooses the top k sentences for summary (k is the number of sentences appearing in summary) using some statistical measures and gives us the summarized text. The following statistical methods may be used to produce automatic summaries:-

Length of a sentence
Position of a sentence
Presence of keywords in a sentence
Term Frequency/Inverse Document Frequency(TF-IDF)
Number of anaphors in a sentence, and many more…..

Each feature is assigned a particular weight(which is decided by training the summarizer over some dataset) such that the weights decide the contribution of each of the features to the cumulative feature vector. The cumulative feature vector is calculated, and then the sentences are ranked according to its score. The top k sentences using this method become the part of the summary.

Scoring measures like the above are used to decide the relative importance of sentences in statistical methods

Semantic approach, on the other hand, tries to understand the “meaning” of the sentences instead of just analyzing them statistically. It uses the vectorial representation of sentences, i.e., it represents each sentence in the form of a vector in a representation space, placing the words which have similar semantic meaning near to each other in the space. It then compares different sentences and rank them in terms of their importance using similarity measures like the Jaccard and Cosine similarities and then using clustering techniques to rank sentences in the order of importance and use the top k sentences to give the summary. Using Glove model and LSA (Latent Semantic Analysis)for text summarization is an example of this approach.

Vectorial representation of words in the GloVe model. Words with similar meaning are placed close together

Graph-based approach makes a graph from a document using a certain set of rules, and then summarize by using the relation between the nodes. TextRank, which is based on the PageRank algorithm used in the google search engine, is a typical example of graph-based approach. Wordnet is also occasionally used for this approach.

Wordnet represents words in a hierchial manner[graph-based manner](pic courtesy: ResearchGate)

Figure explaining how sentences and similarities are represented in a graph based approach

Fuzzy Logic based approach, as the name suggests, uses fuzzy logic (which is a kind of multi-valued logic) sets and systems to summarize the text. The basis behind using fuzzy logic is that like most of the things in the world, sentences cannot be classified in a binary sense, i.e. whether a sentence should be included in the summary(the binary digit 1) or not(the binary digit 0). The input to the fuzzy system are the features calculated using the statistical methods. Here, instead of giving weights to every feature and calculating the cumulative feature vector, we calculate the truth values(or more formally, the grade of membership),which range anywhere between 0 and 1, and is determined by the membership function of the sentences(this means that a sentence can be partially included in the summary as well!). We then rank the sentences in the ascending order of their truth values and then choose the top k sentences to appear in our automated summary.

Image describing the workflow of a Fuzzy Logic System (pic courtesy: TutorialsPoint)

Evaluation of Automatic Summaries

For evaluating the accuracy of any automated text summary, the industry gold standard is to use ROUGE-N metric.

ROUGE-N metric compares an automatic summary with a reference summary(a.k.a. Human-produced summary) using the n-gram overlap between the two documents. If we use 1-gram to compare the documents, then the metric is called ROUGE-1, for 2-gram it is called ROUGE-2 and so on.

The formula used to evaluate Rouge-N metric

BLEU(Bilingual Evaluation Understudy) metric is another popular technique for evaluation of machine translation(MT), and can also be applied to automatic summarization.The cornerstone of this metric is the familiar precision measure. It is basically calculated on the n-gram co-occurance between the generated summary and the human-produced summary.

A Qualitative Introduction to Automatic Text Summarization

Definition

Methods of Summarization

Preprocessing

Approaches To Extractive Summarization

Evaluation of Automatic Summaries

Further Reading

Published in The AI Herald

Written by Anukarsh Singh