Analytics Vidhya
Published in

Analytics Vidhya

Kamal khumar

Jul 28, 2020

6 min read

Text summarization using spaCy

The article explains, What is spacy, Advantages of spacy and how to perform text summarization with it.

Photo by Artichok3

What is spaCy?

spaCy is a free, open-source advanced natural language processing library, written in the programming languages Python and Cython. spaCy mainly used in the development of production software and also supports deep learning workflow via statistical models of PyTorch and TensorFlow.

Why spaCy?

spaCy provides a fast and accurate syntactic analysis, named entity recognition and ready access to word vectors. We can use the default word vectors or replace them with any you have. spaCy also offers tokenization, sentence boundary detection, POS tagging, syntactic parsing, integrated word vectors, and alignment into the original string with high accuracy.

Text Summarization

Fig 2: Text Summarization

Text summarization can broadly be divided into two categories — Extractive Summarization and Abstractive Summarization.

  1. Extractive Summarization: These methods rely on extracting several parts, such as phrases and sentences, from a piece of text and stack them together to create a summary. Therefore, identifying the right sentences for summarization is of utmost importance in an extractive method.
  2. Abstractive Summarization: These methods use advanced NLP techniques to generate an entirely new summary. Some parts of this summary may not even appear in the original text.

In this article, we will be focusing on the extractive summarization technique.

Step: 1 Installation instructions

To install spaCy, simply type the following:

pip install -U spacy

To begin with import spaCy and other necessary modules:

import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from string import punctuation
from collections import Counter
from heapq import nlargest

Next, load the model (English) into spaCy

Fig 3: Loading the model

The text we are about to handle is “Introduction to Machine Learning” and the string is stored in the variable doc.

And the string is,

Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task. Machine learning algorithms build a mathematical model of sample data, known as “training data”, in order to make predictions or decisions without being explicitly programmed to perform the task. Machine learning algorithms are used in the applications of email filtering, detection of network intruders, and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task. Machine learning is closely related to computational statistics, which focuses on making predictions using computers. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a field of study within machine learning and focuses on exploratory data analysis through unsupervised learning. In its application across business problems, machine learning is also referred to as predictive analytics.

Now, pass the string doc into the nlp function.

Fig 4: Tokenization

To find the number of sentences in the given string the following function is used,

Fig 5: No. of sentence
7

Next, two lists are created for parts-of-speech and stop words to validate each token followed by filtering of the necessary tokens and save them in the keywords list.

Step: 2 Filtering tokens

Fig 6: Keyword filtering

Calculate the frequency of each token using the “Counter” function, store it in freq_word and to view top 5 frequent words, most_common method can be used.

Fig 7: Token frequency

The desired output would be,

[(‘learning’, 8), (‘Machine’, 4), (‘study’, 3), (‘algorithms’, 3), (‘task’, 3)]

This frequency can be normalised for better processing and it can be done by dividing the token’s frequencies by the maximum frequency.

Step: 3 Normalization

Fig 8: Normalising token frequency

The normalised list is,

[(‘learning’, 1.0), (‘Machine’, 0.5), (‘study’, 0.375), (‘algorithms’, 0.375), (‘task’, 0.375)]

This is the major part where each sentence is weighed based on the frequency of the token present in each sentence. The result is stored as a key-value pair in sent_strength where keys are the sentences in the string doc and the values are the weight of each sentence.

Step: 4 Weighing sentences

Fig 9: Weighing the sentence

And the output is,

{Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to progressively improve their performance on a specific task.: 4.125, 
Machine learning algorithms build a mathematical model of sample data, known as “training data”, in order to make predictions or decisions without being explicitly programmed to perform the task.: 4.625,
Machine learning algorithms are used in the applications of email filtering, detection of network intruders, and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task.: 4.25,
Machine learning is closely related to computational statistics, which focuses on making predictions using computers.: 2.625,
The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning.: 3.125,
Data mining is a field of study within machine learning, and focuses on exploratory data analysis through unsupervised learning.: 4.25, In its application across business problems, machine learning is also referred to as predictive analytics.: 2.25}

Finally, nlargest function is used to summarize the string, it takes 3 arguments,

→ Number of data to extract

→ An Iterable (List/Tuple/Dictionary)

→ Condition to be satisfied, respectively

Step: 5 Summarizing the string

Fig 10: Finding N largest

And the nlargest function returns a list containing the top 3 sentences which are stored as summarized_sentences.

[Machine learning algorithms build a mathematical model of sample data, known as “training data”, in order to make predictions or decisions without being explicitly programmed to perform the task., Machine learning algorithms are used in the applications of email filtering, detection of network intruders, and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task., Data mining is a field of study within machine learning, and focuses on exploratory data analysis through unsupervised learning.]

Each sentence in this list is of spacy.span type

Fig 11: Type of token
spacy.tokens.span.Span

This can be converted to a string by the following lines of code,

Fig 12: Final result

Resulting in a final summarized output as

Machine learning algorithms build a mathematical model of sample data, known as “training data”, in order to make predictions or decisions without being explicitly programmed to perform the task. Machine learning algorithms are used in the applications of email filtering, detection of network intruders, and computer vision, where it is infeasible to develop an algorithm of specific instructions for performing the task. Data mining is a field of study within machine learning, and focuses on exploratory data analysis through unsupervised learning.

Summarization using Gensim

Gensim package is known to have an inbuilt summarization function but it is not as efficient as spaCy. The code is,

Fig 13: Summarization using Gensim

The respective output is,

'Machine learning algorithms build a mathematical model of sample data, known as “training data”, in order to make predictions or decisions without being explicitly programmed to perform the task.'

The full code is available in GitHub.

Conclusion

I hope you have now understood how to perform text summarization using spaCy. Thanks for reading!