How can you summarize a document even without reading it?

7 min readMar 23, 2021

What would you do when you have too many documents to read but not enough time? How will you decide which are the ones worth reading? How will you know the contents of each document even without reading them?

Well, there are two possible solutions for this. One — going by the traditional method and reading all the documents which may result in you wasting too much time on insignificant content. A better solution — using a text summarizer.

Text Summarizer — A tool which generates a condensed version of the original document

But how do you create a Text Summarizer?

In this blog, we will see how to develop a query oriented extractive Text Summarizer using Unsupervised Deep Learning techniques. The term query oriented here means the summary will be generated according to the query/ topic entered by the users as per their interests. This can be best implemented using Python as it provides an extensive set of libraries like Pdfminer, NLTK, NumPy, etc. that have functions which can be used for our process.

Extractive text summarization — In this method, sentences are directly picked from the document based on some features to form a brief summary
Unsupervised learning — where a machine learns and looks for patterns in a dataset having no labels or outcomes
Deep learning — this type of learning involves multiple processing layers to extract features from data

Steps to create a Text Summarizer

A] Input — In the first step, the user will give a single PDF or a text document they want to summarize as input. Read the contents of this file and store it in a variable.

B] Pre-processing — In this step, the words which are not important are removed in order to structure the document and reduce its density. This makes the further processing of the document easier. The four techniques used are:-

Stop-word removal — Stop-words i.e. words which are not important and don’t have any meaning on their own are filtered in this step. Usually articles, prepositions, etc. are considered as stop words and are removed. Here, we have considered words such as a, an, the, is, are, on, etc. as stop-words. NLTK corpus provides you with a bunch of stops words and using this set, you can remove these stop-words from your text.
Part of Speech Tagging — Categorizing the words of text on the basis of part of speech (noun, adverb, verb, adjective etc.) is called as Part of Speech Tagging. So, in this step, the words are marked corresponding to a part of speech.
Stemming — In this step, all the words are brought to their base or root form except for the words which belong to the proper nouns category. Example: the word ‘boys’ reduced to ‘boy’, ‘swimming’ reduced to ‘swim’. It will not always be the case that the stemmed word would be the same as the root form of the word.
Punctuation marks removal — The punctuation marks like “, . : ; ? / “ etc from the document are removed in this step, hence making the document light-weight and further simplifying the summary generation process.

The NLTK library provides functions which can be used to execute the above pre-processing techniques.

C] Feature Vector extraction — The document which is made light weight in the pre-processing phase is now structured into a matrix. A sentence matrix ‘M’ of order n*v contains the features for every sentence of a matrix. Here, ‘n’ is the number of sentences in the document and ‘v’ is the number of features. The five features that we will be considering are:-

Title Similarity — A sentence in the document is said to be important for the summary if it is similar to title of the document. The title similarity is calculated using the sentence score which is defined as the ratio of number of common words occurring between the title and the sentences in the document to the total number of words of the document. The feature sentence of a sentence is said to be good if it has maximum number of words common to the title.

2. Sentence Position — The position of the sentence can determine the relevance of the sentence for the summary. Usually the sentences that appear in the starting and ending of the text are of more importance. So, based on this the sentence score is calculated. The positional score in our case is calculated by considering the following conditions:-

Here, we have considered 20% of sentences from the start and the end of the text as important and marked their p2 value as 1.

3. Term Weight — Term weight means the term frequency and its importance. The term frequency(tf) gives the total number of times the term has occurred in the whole document which depicts the importance of the term in a document. The inverse sentence frequency(isf) tells whether the term is common or rare across the document.

4. Sentence Length — The sentence length decides the importance of the sentence in summarization. The sentences that are too short do not give much information about the document. Whereas sentences that are too long will have unnecessary information about the document that will not be useful for summarization.

5. Proper Noun Score — In the process of summary generation, important role is played by the Proper Nouns. It gives information regarding, to whom or to what the author is referring. Roles played by individuals or description of locations will be different more number of times in a document. Here, the number of words which are proper nouns are counted.

D] Feature Matrix Generation — The above calculated features’ values are then stored in a matrix form where the columns represent the features and rows represent the sentences.

Feature matrix generated for sample text

E] Algorithm for Deep Learning — Here, we are using Restricted Boltzmann Machine(RBM) for deep learning. The sentence matrix containing a set of feature vectors is given as an input to the RBM phase as a visible layer.

F] Enhanced Feature Matrix — An Enhanced feature matrix is obtained from the deep learning phase which is used for the further summary generation phase.

G] Generation of Summary — Summary can be generated in two ways. One is a generalized summary of the whole document and the other way is based on the user query entered by the user.

Sentence score — This step is performed only for user-query based summary generation. Here, the ratio of the number of words which are common in the sentence and the query entered by the user to the total number of words in the document are calculated.

2. Sentence ranking — The number of sentences which should be there in the summary is calculated by

N=0.3* total number of sentences in the text

Here, we are considering 30% of the sentences in the document for the summary. You can change this numeric value if you want a more shorter or longer summary.

Sentence ranking for generalized summary: The sum of the feature values for each sentence in the enhanced matrix is calculated. Now based on these values, the sentences are arranged in descending order(sentences having higher feature sum). Now, the top N sentences are selected from this sorted order of sentences.

Sentence ranking for user-query based summary: Here, based on the sentence score, sentences are arranged in descending order. Now, the top N sentences are selected from this sorted order of sentences.

3. Summary generation — The N sentences selected are now arranged according to their position in the text and then displayed to the user. In this way, extractive summary is generated.

The summary produced by the above text summarization method using unsupervised deep learning algorithm is better than the summary produced by methods using supervised learning techniques as no training data-set is required in the proposed approach. Hence, the time and cost required to train the data-set is saved. So, this method is more efficient.

You can also develop a GUI for this tool using Python package Tkinter. Below is a basic sample GUI developed by me using Tkinter.

Future Applications:- This method can be further improved for generating summaries of books of large volumes. Also, we can add some features to this summarization method to generate user-query based summary of medical reports. Researchers can use this tool to generate summaries of their papers and anyone can use it to summarize any books or papers.

Conclusion

In this blog, we have seen how we can reduce our burden of reading as well as analyzing bulk and bulk of documents by creating a smart tool called Text Summarizer. We have seen a few applications of this tool in this blog and you can also let me know your ideas on this.

References

M Yousefi-Azar, ”Text Summarization using unsupervised deep learning”, Science Direct, Volume:68, 2017, Page:93–105

PadmaPriya G. and 2K. Duraiswamy , ”An Approach for Text Summarization using deep learning approach”, Journal of Computer Science, Volume:10, 2014, Page:1–9

How can you summarize a document even without reading it?

Steps to create a Text Summarizer

Written by Kajol Shah