Automatic Text Summarization made simpler using Python

ACES PVGCOET
1 Hour Blog Series
Published in
8 min readApr 7, 2021

What is Text Summarization?

There is an enormous amount of textual material, and it is only growing every single day.

Think of the internet, consisting of web pages, news articles, status updates, blogs and so much more. The data is unstructured and the best that we can do to navigate it is to use search and skip the results.

There is a great need to reduce much of this text data to shorter, focused summaries that capture the salient details, both so we can navigate it more effectively as well as check whether the larger documents contain the information that we are looking for.

Most of this huge volume of documents is unstructured and has not been organized into traditional databases for decades furthermore processing documents is therefore a tedious task,

In the book released in 2014 “Automatic Text Summarization,” the authors provide six reasons why one should use automatic text summarization tools.

  1. Summaries reduce reading time of the user .
  2. When researching documents summaries make the choice process easier and accurate .
  3. Automatic summarization improves the quality of documentation by indexing the document.
  4. Automatic summarization algorithms are less biased as compared to human generated summary which makes it exceedingly good where human sentiments are involved .
  5. Personalized summaries are useful in question-answering systems as they supply personalized information.
  6. Using automatic summarization systems enables commercial abstract services to extend their limit of processing tasks.

Types of Text Summarization

Since there are no fixed rules designed for generation of summary. Although for performing tasks in an organized way they’re generally be categorised into these following types:

  1. Short Tail Summarization: In short tail summarization input content is extremely precise furthermore it needs to contain important information related to the text.
  2. Long Tail Summarization: As the name suggests that content here could be extremely large in size moreover its difficult to manage by humans as It could contain textual information from thousands of sources directly .
  3. Single Entity: When the input usually contains elements from only one source.
  4. Multiple Entities: In this the input contains elements from different document sources.

Approaches for automatic summarization

Summarization algorithms are mostly extractive or abstractive in nature depending on the summary generated by the algorithm . Extractive algorithms form summaries by identifying and pasting together relevant sections of the text moreover depending only on extraction of textual information from the original input text. For such a reason, extractive methods require relatively little linguistic analysis.

In contrast, abstractive algorithms are generally most human-like algorithms which imitate the process of paraphrasing given a text.In this approach it may generate new textual information that is not present in the initial document thus making it more readable . Texts summarized using this technique look more human-like and produce condensed summaries which are easier to read .Downside is that abstractive techniques are complex to generate as compared to extractive summarization techniques.

Applications of Text Summarization

  1. News: Text summarization serves as a viable solution for reading news in a relatively short time frame .
  2. Scientific Research: Research papers can be summarized for saving users time.
  3. Social Media Posting: Content on Social media is humengious in size but text summarization algorithms can help to scale down the information while preserving the information .
  4. Creating Study Notes: Test summarization algorithms can serve as viable solutions for creating study notes.
  5. Conversation Summary: Long conversations and meeting recording might be first converted into text then important information might be fetched out of them.
  6. Movie Plots and Reviews: Algorithms can help to convert movie plot into bullet points in a reasonable amount of time .

Automatic Text Summarization libraries in Python

1 Spacy

It is a library for advanced Natural Language Processing in Python and Cython. It’s built on the very latest research, and was designed from day one to be used in real products. It comes with pre-trained statistical models and word vectors, and currently supports tokenization for 49+ languages.

Implementation

import spacy

Text processing

from spacy.lang.en.stop_words import STOP_WORDSfrom string import punctuation

Build a list of stopwords

stopwords = list(STOP_WORDS)

Document

document1 =”””Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task. Machine learning algorithms build a mathematical model of sample data, known as “training data”, in order to make predictions or decisions without being explicitly programmed to perform the task. Machine learning algorithms are used in the applications of email filtering, detection of network intruders, and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task. Machine learning is closely related to computational statistics, which focuses on making predictions using computers. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a field of study within machine learning, and focuses on exploratory data analysis through unsupervised learning.In its application across business problems, machine learning is also referred to as predictive analytics.”””document2 = “””Our Father who art in heaven, hallowed be thy name. Thy kingdom come. Thy will be done, on earth as it is in heaven. Give us this day our daily bread; and forgive us our trespasses, as we forgive those who trespass against us; and lead us not into temptation, but deliver us from evil“””nlp = spacy.load(‘en’)

Build an NLP Object

docx = nlp(document1)

Tokenization of Text

mytokens = [token.text for token in docx]

WORD FREQUENCY TABLE

  • dictionary of words and their counts
  • How many times each word appears in the document
  • Using non-stopwords

Build Word Frequency

word.text is tokenization in spacy

word_frequencies = {}for word in docx:if word.text not in stopwords:if word.text not in word_frequencies.keys():word_frequencies[word.text] = 1else:word_frequencies[word.text] += 1word_frequencies

MAXIMUM WORD FREQUENCY

  • find the weighted frequency
  • Each word over most occurring word
  • Long sentence over short sentence

Maximum Word Frequency

maximum_frequency = max(word_frequencies.values())for word in word_frequencies.keys():word_frequencies[word] = (word_frequencies[word]/maximum_frequency)

WORD FREQUENCY DISTRIBUTION

Frequency Table

word_frequencies{‘Machine’: 0.4444444444444444,‘learning’: 0.8888888888888888,‘ML’: 0.1111111111111111,

SENTENCE SCORE AND RANKING OF WORDS IN EACH SENTENCE

  • Sentence Tokens
  • scoring every sentence based on number of words
  • non stopwords in our word frequency table

Sentence Tokens

sentence_list = [ sentence for sentence in docx.sents ]

Sentence Score via comparing each word with sentence

sentence_scores = {}for sent in sentence_list:for word in sent:if word.text.lower() in word_frequencies.keys():if len(sent.text.split(‘ ‘)) < 30:if sent not in sentence_scores.keys():sentence_scores[sent] = word_frequencies[word.text.lower()]else:sentence_scores[sent] += word_frequencies[word.text.lower()]

GET SENTENCE SCORE

Sentence Score Table

print (sentence_scores)

{Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task.: 4.555555555555556,Machine learning algorithms build a mathematical model of sample data, known as “training data”, in order to make predictions or decisions without being explicitly programmed to perform the task.: 7.333333333333331,Machine learning is closely related to computational statistics, which focuses on making predictions using computers.: 4.111111111111112,The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning.: 4.555555555555556,Data mining is a field of study within machine learning, and focuses on exploratory data analysis through unsupervised learning.: 5.777777777777778,In its application across business problems, machine learning isFor printing summary concatenate all the sentences contained in Sent having sentence_score above certain threshold.Threshold =0.6for i in sentence_score:if(i > Threshold):print(sent)

2 Gensim

A python library for Topic Modelling. It is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community.

Implementation

Import packages

from gensim.summarization.summarizer import summarizefrom gensim.summarization import keywords

Word_count specifies the number of words summary should contain

Test should contain information which you want to summarize .

For generation of summary with word count =50

summary=summarize(text, word_count=50)

You can adjust how much text the summarizer outputs via the “ratio” parameter or the “word_count” parameter. Using the “ratio” parameter, you specify what fraction of sentences in the original text should be returned as output. Below we specify that we want 50% of the original text (the default is 20%).

For generation of summary

print summarize(text, ratio=0.5)

For extraction of keywords

keywords(text)

3 py summarization

The function of this library is automatic summarization using a kind of natural language processing and neural network language model. This library enables you to create a summary with the major points of the original document or web-scraped text that is filtered by text clustering. And this library applies accel-brain-base to implement Encoder/Decoder based on LSTM improving the accuracy of summarization by Sequence-to-Sequence(Seq2Seq) learning.

Implementation

from pysummarization.nlpbase.auto_abstractor import AutoAbstractorfrom pysummarization.tokenizabledoc.simple_tokenizer import SimpleTokenizerfrom pysummarization.abstractabledoc.top_n_rank_abstractor import

TopNRankAbstractor

Prepare an English string argument.

document = “Natural language generation (NLG) is the natural language processing task of generating natural language from a machine representation system such as a knowledge base or a logical form. Psycholinguists prefer the term language production when such formal representations are interpreted as models for mental representations.”And instantiate objects and call the method.# Object of automatic summarization.auto_abstractor = AutoAbstractor()# Set tokenizer.auto_abstractor.tokenizable_doc = SimpleTokenizer()# Set delimiter for making a list of sentence.auto_abstractor.delimiter_list = [“.”, “\n”]# Object of abstracting and filtering document.abstractable_doc = TopNRankAbstractor()# Summarize document.result_dict = auto_abstractor.summarize(document, abstractable_doc)

For printing summary.

for sentence in result_dict[“summarize_result”]:print(sentence)

Comparative study of Gensim, Spacy and pysummarization

Gensim

A python library for Topic Modelling. It is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community.

Gensim is an open source tool with 9.65K GitHub stars and 3.52K GitHub forks

Gensim and SpaCy belong to the “NLP / Sentiment Analysis” category of the tech stack.

Spacy

Industrial-Strength Natural Language Processing in Python. It is a library for advanced Natural Language Processing in Python and Cython. It’s built on the very latest research, and was designed from day one to be used in real products. It comes with pre-trained statistical models and word vectors, and currently supports tokenization for 49+ languages.

According to the StackShare community, SpaCy has a broader approval, being mentioned in 14 company stacks & 11 developers stacks; compared to Gensim, which is listed in 3 company stacks and 5 developer stacks.

Gensim and SpaCy belong to the “NLP / Sentiment Analysis” category of the tech stack.

Pysummarizer

The function of this library is automatic summarization using a kind of natural language processing and neural network language model. This library enables you to create a summary with the major points of the original document or web-scraped text that is filtered by text clustering.

References

Spacy : https://spacy.io/

Author :Matthew Honnibal

Gensim :https://radimrehurek.com/gensim/

Author :Radim Řehůřek

Pysummarization :https://pypi.org/project/pysummarization/

Author :Juan‐Manuel Torres‐Moreno

This article is contributed by Kushal Burad

LinkedIn Profile: https://www.linkedin.com/in/kushal-burad-8174b017a/

--

--