Text Summarization

Gauravi Dungarwal
Text Summarization
Published in
9 min readJan 15, 2022

In today’s digital world, it is a challenging task of natural language processing to get the exact summary of data required. Whenever we open any website, we get links for too many articles, but most of them may not be of our interest. So, we just take a glance at a short news summary and then read the news in detail. Summarization is necessary while viewing multiple web documents. The method of extracting important information from the huge data without losing crucial information is called “Text summarization”.

Text summarization is a technique that converts long pieces of text into short summaries with few sentences having only the key points of the text.

Types of Text Summarization –

Text summarization methods are classified into different types -

Fig.1 Types of Text Summarization, Source: https://devopedia.org/text-summarization

Based on Input type it is classified as:

• Single Document — This type of summarization converts the text into condensed and shorter text where key information is not lost.

• Multi Document — This technique converts set of documents into a short piece of text by preserving key information and filtering out the unnecessary information.

Based on the purpose it is classified as:

• Generic — in this technique the model does not make any assumptions about the content or domain of the text to be summarized and consider all inputs as of similar kind.

• Domain specific — In this type of summarization, model consider the domain specific knowledge and summarizes the text to form accurate summary.

• Query based — In this type, the summary after summarization only contains information which are answers to natural language questions about the input text.

Based on output type it is classified as:

• Extractive — This technique only extracts the important sentences or phrases from the text and are then stack together to form a complete summary.

• Abstractive — in this type, models generate its own sentences and phrases which are not present in original text to form more logical summary.

Extractive Summarization –

Fig. 2 Extractive Summarization, Source: https://broutonlab.com/blog/summarization-of-medical-texts-machine-learning

Extracts the crucial and meaningful sentences from the document and forms a summary. It then combines all main lines to make the summary. So here, every line and each word of the summary belongs to the summarized document.

Extractive summarization techniques perform some basic tasks:

1. Construct an intermediate representation of the input text (text to be summarized)

2. Score the sentences based on the constructed intermediate representation

3. Select a summary consisting of the top k most important sentences

The first task, intermediate representation, could use further elaboration.

The Extractive Summarizers initially produce an intermediate representation that has the main task of featuring or bring out the crucial information of the text to be summarized based on the representations.

Topic representation and indicator representation, are briefly defined below, as are their sub-categories.

1.Topic representations:

It focuses on representing the topics represented within the texts. There are some types of methods to obtain this representation. Here, we speak about two of them. Others include Latent Semantic Analysis and Bayesian Models (latent Dirichlet allocation (LDA)).

Frequency driven approaches:

In this method, we assign weights to the words. If the word is related to the topic, we assign 1 else 0. The weights could also be continuous based on the implementation. Two simple approaches for topic representation are:

Word Probability:

It makes use of the frequency of words as a sign of the significance of the word. The probability of a word w is given by the frequency of events of the word, f (w), divided by all words within the input which holds a total of N words.

For sentence importance using the word probabilities, the importance of a sentence is given by the typical importance of the words within the sentence.

TFIDF (Tern Frequency Inver Document Frequency): This technique comes up with an advancement to the word probability method. Also, the TFIDF method is working for assigning the weights. TFIDF perhaps a method that assigns low weights to the words that occur very frequently in most of the documents under the instinct that they are stop words or words like “The”. Otherwise, because of the term frequency if a word appears during a document uniquely with a high frequency it’s given high weightage.

Topic word Approaches: This approach is comparable to Luhan’s approach. “The topic word technique is one of the simple topic representation methods which focuses to find words that relate the subject of the input document”. This approach calculates the word frequencies and uses a frequency threshold to search the word that probably describes the topic. It classifies the importance of a sentence because the function of the amount of topic words it contains.

2. Indicator representation:

This kind of representation rely on the features of the sentences and rank them on the premise of the features. So, here the importance of the sentence is not eager about the words it contains also seen within the Topic representations but directly on the particular feature of sentences.

Making use of set of features to represent and rank the text data may be implement any one of two overarching indicator representation methods: graph methods and machine learning approaches.

Graph-Based Methods:

This can be supported the Page Rank algorithm. It represents text documents as connected graphs. The sentences are represented because the nodes of the graphs and edges between the nodes are a sign of comparability between the two sentences.

Machine-Learning Methods:

The machine learning methods approach the summarization problem as a classification problem. These models test and classify sentences which supported their features, summary or non-summary sentences. For training purpose, training set of documents and their corresponding human reference extractive summaries needed. Normally Naive Bayes, Decision Tree, and Support vector machines are used here.

Scoring and Sentences Selection:

Now, once we get the intermediate representations, we move to assign some scores to every sentence to specify their importance. For topic representations, a score to a sentence depends on the available topic words, and for an indicator representation, the score depends on the particular sentence features. At last, those sentences having top scores are picked and typically generate a summary.

Abstractive Summarization –

Fig. 3 Abstractive Summarization, Source: https://broutonlab.com/blog/summarization-of-medical-texts-machine-learning

Abstractive summarization is classified as:

1. Structured based abstractive summarization:

Includes different methods such as tree-based method, rule-based method, ontology-based method, graph-based method, lead body phrase method etc.

Graphs are used to show the bigram relationship between the words in text. The graph helps in reducing the words that occurs more than once in the text, by mapping them on the same vertex.

The score of paths in the graph is based on redundancy of the overlapping sentences. Here the redundancy depends on the difference between the positions of words in the sentences.

Fusing sentiments:

A vertex in a graph is considered to fuse the sentences if it is a verb. The fusion of sentences is needed to calculate the sentiment of sentences. Once the score for all paths has been calculated and also the sentences are fused, we can easily score the sentences in decreasing order. After that we can remove the duplicate sentences from summary by using distance measure, Jaccard index. Finally, the topmost sentences are selected for the summary.

2. Semantic based Abstractive summarization method:

It includes several methods like information item-based method, semantic graph-based method, semantic text representation model etc. The steps involved are as follows:

Text preprocessing:

In this step cleaning of the text is done as

- Lower casing:

It is the idea of converting input text into same case, so that it helps in counting the duplicate words.

- Removal of punctuation marks:

This removes the punctuation marks from the text, that will help to treat ‘awesome’ and ‘awesome!’ in the same way.

“String Punctuation” in python contains the following punctuations:

!"#$%&\'()*+,-./:;<=>?@[\\]^_{|}~`

- Removal of stop words:

The words in natural language like the, a, an, is and so on are called as stop words. These words don’t possess any valuable information, so can be removed.

In python the NLTK package provides all stop words of English language.

- Removal of frequent words:

Here frequent words are removed.

e.g., “I understand. I would like to assist you….” will be converted to “understand would like assist….”.

- Removal of rare words

Removes rare words from the text.

- Stemming

It is the process of deriving the root or base word.

e.g., if we use the words “walks” or “walking”, these are stemmed to the word “walk”.

Python provides one famous stemmer, “porter stemmer” present in the NLTK package.

- Lemmatization

It is just similar to stemming, only it makes sure that the stem word is present in language or not.

Python provides WordNetLemmitizer for this.

- Removal of emojis

On social media we generally use emojis instead of text. But while processing the text we have to convert these emojis to textual format.

- Removal of URLs

- Removal of HTML tags

- Spelling correction: Python provides Spellchecker for this.

Text Summarization Models Used in Industry -

1. BERT (Bidirectional Encoder Representations from Transformers

BERT is a pre-NLP training program, developed by Google. It uses Transformer, the construction of a new neural network based on a self-awareness machine. It is designed to deal with the problem of sequential conversion or neural translation. That is, it is best suited for any task that converts input sequence to output sequence, such as speech recognition, text conversion, speech, etc.

With its vanilla method, the transformer combines two distinct modes: the encoder (which reads text input) and the decoder (which produces activity prediction). The goal of the BERT approach is to produce a language model. Therefore, only the encoding tool is required.

The BERT algorithm has been proven to perform 11 NLP tasks successfully. Trained with 2,500 million Wikipedia words and 800 million words in the Book Corpus database. Google Search is one of the best examples of BERT efficiency. For other applications from Google, such as Google Docs, Gmail Smart Compose uses BERT to predict text.

2. OpenAI’s GPT-2

The GPT-2 is a transformer-based NLP model that enables translation, answering questions, composing poems, closing tasks, and tasks that require thinking along the way such as moving words. In addition, with its latest developments, GPT-2 is being used for news writing and coding.

GPT-2 can handle numerical dependence between different names. Trained with more than 175 billion parameters in 45 TB of text downloaded across the internet. In this regard, it is one of the largest types of pre-trained NLP available.

What distinguishes GPT-2 from other language models is that it does not need to be properly adjusted to perform the functions below. With its ‘in-text, text-out’ API, developers are allowed to redesign the model using instructions.

Conclusion

The advantages of Automatic Text Summarization extend beyond resolving immediate issues. Other advantages of text summarization include the following:

Text summarising helps content editors save time and effort by generating automatic summaries, which would otherwise be spent manually creating article summaries.

Instant Reaction:

It decreases the amount of work required by the user to find the relevant information. With automatic text summarising, a person can summarise an article in a matter of seconds, reducing their reading time.

Increases Productivity Level:

Test Summarization allows the user to examine the contents of a text for accurate, short, and exact information, which increases productivity. As a result, the tool relieves the user of work by reducing the size of the text and increasing productivity by allowing the user to focus their energy on more important tasks.

Ensures that all critical information is included:

Automatic software, on the other hand, does not miss important subtleties that the human eye does. Every reader wants to be able to extract the information that is most useful to them from any piece of text. The automatic text summarization technique makes it simple for the user to acquire all of the important information in a document.

References:

· https://broutonlab.com/blog/summarization-of-medical-texts-machine-learning

· https://www.machinelearningplus.com/nlp/text-summarization-approaches-nlp-example/

· https://www.impelsys.com/an-overview-of-text-summarization-in-natural-language-processing/

· https://www.analyticsvidhya.com/blog/2018/11/introduction-text-summarization-textrank-python/

. https://openai.com/blog/better-language-models/

. https://arxiv.org/abs/1810.04805

--

--