TLDR !! Summarise articles and content with NLP

Sachin Patel

Published in

Practical Data Science and Engineering

6 min readOct 15, 2020

“Our intelligence is what makes us human, and AI is an extension of that quality.” –Yann LeCun

Intro

We are living in the internet age. We generate 2.5 quintillion bytes of data created every day

(90% of the data in the world today has been created in the last two years). This data comes from everywhere: sensors used to gather shopper information, posts on social media sites, digital pictures and videos, purchase transactions, and cell phone GPS signals to name a few.

We generate a lot of text data daily e.g. 277,000 tweets happens per minute every day, 600 new wiki pages created every minute.

So a normal human prefers to read a short summary of the latest trend, news or article.

It’s impossible for a human to read some 1000s of wiki/news articles and come up with a couple of lines of summary which provides the zest of the event.

Here the Automatic summarization comes in the picture.

What is Automatic Text Summarisation

Automatic summarization is the process of shortening a text document with software, to create a summary with the major points of the original document. Technologies that can make a coherent summary take into account variables such as length, writing style and syntax.

Why Automatic text summarization?

Reduce reading time.
Remove human bias factor.
Improve the effectiveness of indexing(easy for searching by the search engine like Google)

Some use cases of Automatic summarization

News-Letters
Many weekly newsletters take the form of an introduction followed by a curated selection of relevant articles. Summarization would allow organizations to further enrich newsletters with a stream of summaries (versus a list of links).
Internal Knowledge flow
Large companies are constantly producing internal knowledge, which frequently gets stored and under-used in databases as unstructured data. These companies should embrace tools that let them re-use already existing knowledge. Summarization can enable analysts to quickly understand everything the company has already done in a given subject, and quickly assemble reports that incorporate different points of view.
Books and literature
Google has reportedly worked on projects that attempt to understand novels. Summarization can help consumers quickly understand what a book is about as part of their buying process.
Science and R&D
Academic papers typically include a human-made abstract that acts as a summary. However, when you are tasked with monitoring trends and innovation in a given sector, it can become overwhelming to read every abstract. Systems that can group papers and further compress abstracts can become useful for this task.

So How can we make a machine to summarize for human:

We, humans, are great at summarizing things after reading and understanding the context of the text and distilling the required key fact in the summary. But we can use technology like machine learning to teach a computer how to summarise text like a human.

Types of text summarization

Extractive

In Extractive summarization, the machine will take out or extract a subset of phrases, sentences which describes the article/text.

This is a more mature approach to summarise, this approach was first to introduce in 1958.

Abstractive

Abstractive Summarization: the abstractive approach is to understand the context or any in-depth analysis of a particular subject and write a summary of different words and sentences from the original. Humans are really good at doing this type of approach, and it was very difficult to teach a machine to do the same. Until recently some researchers found out it’s possible.it’s only possible because of the cutting edge deep learning technology which is a sub-branch on machine learning.

Extractive — Deeper explanation on how this is done

Extractive summarisation compares the main points from the text with little or no modifications and rearranging the points in order so that grammatically they make sense as a summary.

We can consider this method as a smart and automated highlighting pen, which highlights important sections or sentences of an article and then apply numerical weights to them so that they can be rearranged and put together based on the relevance and context of the article.

The problem with this technique is they are not highly grammatically accurate and sometimes they miss the reference of things which described earlier part of the article.b’coz of these problems the summary will look ambiguous.

Extractive — Various methods

Extractive summarization is formed of 3 independent task

Constructing an intermediate representation of input sentences or text.

Topic word approach

In this approach, the aim is to identify the words that describe the topic of the input text.

There are two ways to compute the importance of a sentence: as a function of the number of topic signatures it contains, or as the proportion of the topic signatures in the sentence. While the first method gives higher scores to longer sentences with more words, the second one measures the density of the topic words.

Frequency driven approaches

This approach uses the frequency of the words as an indicator of the importance of words like TFIDF( Term Frequency Inverse Document Frequency)

The importance of words increases proportionally the number of times a word appears in the document.

Latent semantic analysis (LSA)

It is an unsupervised technique that we used to generate a matrix by using the words that are present in text/article. The rows of the matrix will represent the unique words present in each paragraph, and columns represent each paragraph.

Bayesian Topic Models

In topic modelling of text documents, the goal is to infer the words related to a certain topic and the topics discussed in a certain document, based on the prior analysis of a corpus of documents.

Scoring the sentences based on the construction intermediate representation

Graph method

One of the most popular algorithms is TextRank .its based on the Google PageRank algorithm. PageRank algorithm is heavily used in the google search engine to rank the page.TextRank creates a graph where sentences are vertices, connect every sentence with edges and weight the edges to show how similar they are them run the page rank algorithm to find the highest score sentences(vertices).

Machine learning approach

In this approach, we assume text summarization as a text classification problem and then we use techniques like Naive Bayes, decision trees etc to get the summary out of it. This is done by labelling a small amount of data(sentences) and then predicting the sentences or phrases is part of the summary or not. It works well to fix format summaries like scientific papers, rental agreements etc.

Selecting a summary consisting of the top k most important sentences.

This approach uses greedy algorithms to select the important sentence based on the scoring of the sentences we have done earlier and stitched the sentences in a more meaningful way. In step we can also fix a few things after getting the stitched sentences like checking grammar, fixing coreference etc.

Some popular algorithms & implementations

Text Summarization in Gensim

Gensim’s automatic text summarizer uses TextRank, an unsupervised algorithm based on weighted-graphs. It is built on top of the popular PageRank algorithm that Google used for ranking web pages.

PyTeaser

PyTeaser is a Python implementation of the Scala project TextTeaser, which is a heuristic approach for extractive text summarization.TextTeaser associates a score with every sentence. This score is a linear combination of features extracted from that sentence.

PyTextRank

PyTextRank is a python implementation of the original TextRank algorithm with a few enhancements like using lemmatization instead of stemming, incorporating Part-Of-Speech tagging and Named Entity Resolution, extracting key phrases from the article and extracting summary sentences based on them. Along with a summary of the article, PyTextRank also extracts meaningful key phrases from the article.

LexRank

LexRank is an unsupervised graph-based approach similar to TextRank. LexRank uses IDF-modified Cosine as the similarity measure between two sentences. This similarity is used as the weight of the graph edge between two sentences.

Evaluation of the quality

We can use a technique like ROUGE (Recall-Oriented Understudy for Gist Evaluation.). It is nothing but a set of metrics for evaluating automatic summarization. It works by comparing an automated produced summary against the set of reference summary (human-produced ).

Conclusion

The extractive summarization is easy to implement and pretty much the gold standard of automated text summarisation till a few years ago. Extractive text summarization is a little older technique as compared to modern deep learning-based techniques. But extractive summarization is easy to implement and in some cases, it outperformed the quality of summary it generates as compared to other techniques.