Text Summarization [Part 1 — Extractive Approach and an Abstractive Library]

In this article, I will show you how you can create a text summarization tool in Python using the extractive approach, and a fast way of using the abstractive approach using a predefined library.

Lidor ES
10 min readMar 15, 2023
Google Image Search — Extractive Text Summarization

Table of Contents

  1. Introduction
  2. Text Summarization Approaches
  3. Goal
  4. Usage Text
  5. Extractive Approach #1nltk, networkx
    5.1. Approach #1 Usage
  6. Extractive Approach #2 nltk, heapq
    6.1. Approach #2 Usage
  7. Extractive Approach #3spacy, pytextrank
  8. Abstractive Approach — predefined library
  9. Next Part

Introduction

We often have long textual data as in articles/books, and more.
We can’t always read everything and we do not always have someone to summarize it for us, for this, we have text summarization ways, in this article I will show you one way to create an extractive approach to text summarization and then you will get a link for the next article in this series in it I will show you other ways as well.

The code an be found in my GitHub repo here.

Text Summarization Approaches

In text summarization, there are mainly two approaches: the Extractive approach and the Abstractive approach:

  • Extractive text summarization involves identifying the salient information by selecting essential sentences or phrases from the original text to create a concise summary. This approach preserves the wording and structure of the original text, resulting in a summary that accurately represents the content. However, it may also include irrelevant or redundant information, as the summary is created solely from extracted portions of the original text.
  • Abstractive text summarization involves generating a summary by building an internal semantic representation of the original text and using natural language processing to rewrite it in new words.
    This approach is more complex and requires machine learning models or deep learning (such as neural networks) to produce a summary that captures the essence of the original text. While abstractive summarization can produce more concise and readable summaries that capture the meaning and nuance of the original text, it may only sometimes be accurate or faithful to the original content. Additionally, the abstractive approach is more challenging than the extractive approach, as it involves creating new phrases and terms to summarize the content.

Goal

My goal here is to show you and create a text summarization tool using Python in the extractive approach.

Usage Text

The text I will use in this article approaches will be in a variable called summarize_me:

summarize_me = """The term Data Science was created in the early 1960s to describe a new profession that would support the understanding and interpretation of the large amounts of data which was being amassed at the time. At the time, there was no way of predicting the truly massive amounts of data over the next fifty years. Data Science continues to evolve as a discipline using computer science and statistical methodology to make useful predictions and gain insights in a wide range of fields. While Data Science is used in areas such as astronomy and medicine, it is also used in business to help make smarter decisions.
Statistics, and the use of statistical models, are deeply rooted within the field of Data Science. Data Science started with statistics and has evolved to include concepts/practices such as artificial intelligence, machine learning, and the Internet of Things, to name a few. As more and more data has become available, first by way of recorded shopping behaviors and trends, businesses have been collecting and storing it in ever greater amounts. With the growth of the Internet, the Internet of Things, and the exponential growth of data volumes available to enterprises, there has been a flood of new information or big data. Once the doors were opened by businesses seeking to increase profits and drive better decision-making, the use of big data started being applied to other fields, such as medicine, engineering, and social sciences.
The term Data Science was created in the early 1960s to describe a new profession that would support the understanding and interpretation of the large amounts of data which was being amassed at the time. At the time, there was no way of predicting the truly massive amounts of data over the next fifty years. Data Science continues to evolve as a discipline using computer science and statistical methodology to make useful predictions and gain insights in a wide range of fields. While Data Science is used in areas such as astronomy and medicine, it is also used in business to help make smarter decisions.
Statistics, and the use of statistical models, are deeply rooted within the field of Data Science. Data Science started with statistics and has evolved to include concepts/practices such as artificial intelligence, machine learning, and the Internet of Things, to name a few. As more and more data has become available, first by way of recorded shopping behaviors and trends, businesses have been collecting and storing it in ever greater amounts. With the growth of the Internet, the Internet of Things, and the exponential growth of data volumes available to enterprises, there has been a flood of new information or big data. Once the doors were opened by businesses seeking to increase profits and drive better decision-making, the use of big data started being applied to other fields, such as medicine, engineering, and social sciences.
A functional data scientist, as opposed to a general statistician, has a good understanding of software architecture and understands multiple programming languages. The data scientist defines the problem, identifies the key sources of information, and designs the framework for collecting and screening the needed data. Software is typically responsible for collecting, processing, and modeling the data. They use the principles of Data Science, and all the related sub-fields and practices encompassed within Data Science, to gain deeper insight into the data assets under review.
There are many different dates and timelines that can be used to trace the slow growth of Data Science and its current impact on the Data Management industry, some of the more significant ones are outlined below."""

Note, the original text includes new lines, tabs, parentheses, and more that got cleaned

Text Source: A Brief History of Data Science, By Keith D. Foote on October 16, 2021

Extractive Approach #1

In this approach, I will use the nltk library and the networkx library mainly.

Firstly import the libraries for this approach:

import numpy as np
import networkx as nx
import nltk, re
from nltk.corpus import stopwords
from nltk.cluster.util import cosine_distance
from nltk.tokenize import sent_tokenize

Now let’s define a function to read the text and split it into a sentences list:

def read_text(txt: str = ""):        
sentences = []
sentences = sent_tokenize(txt)
for sentence in sentences:
sentence.replace("[^a-zA-Z0-9]", " ")
return sentences

Now let’s create a function to calculate each two sentences cosine similarity:

def sentence_similarity(sentence1, sentence2, stopwords = []):
sentence1 = [word.lower() for word in sentence1]
sentence2 = [word.lower() for word in sentence2]
all_words = list(set(sentence1 + sentence2))
vector1 = [0] * len(all_words)
vector2 = [0] * len(all_words)
# First sentence vector
for word in sentence1:
if not word in stopwords:
vector1[all_words.index(word)] += 1
# Second sentence vector
for word in sentence2:
if not word in stopwords:
vector2[all_words.index(word)] += 1
# Vectors cosine similarity
return 1 - cosine_distance(vector1, vector2)

And a function to create a sentences matrix:

def sentences_similarity_matrix(sentences, stopwords_):
similarity_matrix = np.zeros((len(sentences), len(sentences))) # N on N
for i in range(len(sentences)):
for j in range(len(sentences)):
if i != j:
similarity_matrix[i][j] = sentence_similarity(sentences[i], sentences[j], stopwords_)
return similarity_matrix

And now the final function for this approach, in it we will also download the nltk stopwords and punctuations (function got self-explanatory comments):

def summarize(txt, top_n):
nltk.download('stopwords')
nltk.download('punkt')
final_stopwords = stopwords.words('english')
summarized_text = []
# Read and tokenize txt
sentences = read_text(txt)
# Get similarity matrix
sentence_similarity_matrix = sentences_similarity_matrix(sentences, final_stopwords)
# Rank sentences in the given similarity matrix
sentence_similarity_graph = nx.from_numpy_array(sentence_similarity_matrix)
scores = nx.pagerank(sentence_similarity_graph)
# Sort the rank of top sentences
ranked_sentences = sorted(((scores[i], s) for i, s in enumerate(sentences)), reverse = True)
# Get the top n number of sentences based on rank
for i in range(top_n):
summarized_text.append(ranked_sentences[i][1])
# Output the summarized version
summary = " ".join(summarized_text)
return summary, len(sentences)

Approach #1 Usage:

Using summarize(summarize_me, 3) returns the output:

Many different dates and timelines can be used to trace the slow growth of Data Science and its current impact on the Data Management industry, some of the more significant ones are outlined below. They use the principles of Data Science, and all the related sub-fields and practices encompassed within Data Science, to gain deeper insight into the data assets under review. The term Data Science was created in the early 1960s to describe a new profession that would support the understanding and interpretation of the large amounts of data that were being amassed at the time.

Extractive Approach #2

In this approach, I will use the nltk library and the heapq library (For the usage example) mainly.

Note, in this approach the order of the code is important because I used the same aliases, if you want different order you’ll need to change them

Let’s import the second approach libraries:

import re # For extra cleaning
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import sent_tokenize,word_tokenize
from nltk.corpus import stopwords
import heapq

Just to be sure, let’s apply a cleaning function to our text data:

def clean_text(txt):
text = re.sub(r"\[[0-9]*\]", " ", txt)
text = text.lower()
text = re.sub(r'\s+'," ", text)
text = re.sub(r","," ", text)
return text

cleaned_text = clean_text(summarize_me)

Now that the text is 100% clean, let’s create words tokes and calculate the frequency:

words_frequency = {}
word_tokens = word_tokenize(cleaned_text)
stopwords = set(stopwords.words("english")) # Alias example
for word in word_tokens:
if word not in stopwords:
if word not in words_frequency.keys():
words_frequency[word]=1
else:
words_frequency[word] +=1
maximum_frequency = max(words_frequency.values())
for word in words_frequency.keys():
words_frequency[word] = (words_frequency[word] / maximum_frequency)

And now sentences tokens and calculate the scores:

sentences_score = {}
sentences_tokens = sent_tokenize(cleaned_text)
for sentence in sentences_tokens:
for word in word_tokenize(sentence):
if word in words_frequency.keys():
if (len(sentence.split(" "))) < 30:
if sentence not in sentences_score.keys():
sentences_score[sentence] = words_frequency[word]
else:
sentences_score[sentence] += words_frequency[word]

Approach #2 Usage:

And now let’s use the heapq library and extract the top N sentences, you can set your N to your desired amount, I will set it to 5:

def get_key(desired_value): 
for key, value in sentences_score.items():
if desired_value == value:
return key
key = get_key(max(sentences_score.values()))
N = 5
summary = heapq.nlargest(N, sentences_score, key = sentences_score.get)
print(" ".join(summary))

Output:
data science continues to evolve as a discipline using computer science and statistical methodology to make useful predictions and gain insights into a wide range of fields. data science started with statistics and has evolved to include concepts/practices such as artificial intelligence machine learning and the internet of things to name a few. while data science is used in areas such as astronomy and medicine it is also used in business to help make smarter decisions. statistics and the use of statistical models are deeply rooted within the field of data science. at the time there was no way of predicting the truly massive amounts of data over the next fifty years.

Extractive Approach #3

In this approach, I will use the spacy library and the pytextrank library mainly.

To use the libraries you’ll need to install them first:

!pip install spacy
!python -m spacy download en_core_web_lg
!pip install pytextrank

As always, let’s import the libraries for this approach:

import spacy
import pytextrank

Load the English language core and create a textrank pipeline:

nlp = spacy.load("en_core_web_lg")
nlp.add_pipe("textrank")

Process the text to be summarized:

doc = nlp(summarize_me)

Now let’s print the summary as we’d like:

n_phrases, n_sentences = 2, 4
for sent in doc._.textrank.summary(limit_phrases = n_phrases, limit_sentences = n_sentences):
print(sent)

Output:
With the growth of the Internet, the Internet of Things, and the exponential growth of data volumes available to enterprises, there has been a flood of new information or big data. Once the doors were opened by businesses seeking to increase profits and drive better decision-making, the use of big data started being applied to other fields, such as medicine, engineering, and social sciences. With the growth of the Internet, the Internet of Things, and the exponential growth of data volumes available to enterprises, there has been a flood of new information or big data. Once the doors were opened by businesses seeking to increase profits and drive better decision-making, the use of big data started being applied to other fields, such as medicine, engineering, and social sciences.

And if you are curious about the ranks:

phrases_and_ranks = [ 
(phrase.chunks[0], phrase.rank) for phrase in doc._.phrases
]
print(phrases_and_ranks[:10])

Output:

[(big data, 0.12253844453442032),
(data, 0.11412295060482798),
(data volumes, 0.1140447618028625),
(other fields, 0.08935680196072456),
(Data Science, 0.0884026387441695),
(fields, 0.08309610252687923),
(social sciences, 0.0758422375988491),
(computer science, 0.06982236727923087),
(Data Management, 0.06912312972200427),
(smarter decisions, 0.0689398682430784)]

Abstractive Approach — predefined library

For this approach, we will use a library called transformers, which is a library of Hugging Face.

First, let’s install the libraries that are needed:

!pip install transformers
# For anaconda/miniconda use: conda install -c huggingface transformers
!pip install sentencepiece # Needed in order to use the transformers library

Now import the libraries:

from transformers import PegasusForConditionalGeneration
from transformers import PegasusTokenizer
from transformers import pipeline

I will use the Google Pegasus-xsum model, let’s load it and define it:

# Pick model
model_name = "google/pegasus-xsum"
# Load pretrained tokenizer
pegasus_tokenizer = PegasusTokenizer.from_pretrained(model_name)
# Define PEGASUS model
pegasus_model = PegasusForConditionalGeneration.from_pretrained(model_name)
# Create tokens
tokens = pegasus_tokenizer(summarize_me, truncation = True, padding = "longest", return_tensors = "pt")

Now let’s summarize the text:

# Summarize the desired text
encoded_summary = pegasus_model.generate(**tokens)
# Decode the summarized text
decoded_summary = pegasus_tokenizer.decode(encoded_summary[0], skip_special_tokens = True)

The variable decoded_summary will hold the following summary: Once the doors were opened by businesses seeking to increase profits and drive better decision-making, the use of big data started being applied to other fields, such as medicine, engineering, and social sciences.

Let’s define a summarization pipeline now and create a summary with it:

summarizer = pipeline("summarization", model = model_name, tokenizer = pegasus_tokenizer, framework = "pt")
# Create summary
summary = summarizer(summarize_me, min_length = 30, max_length = 150, truncation = True)
print(summary[0]["summary_text"])

Output:
Once the doors were opened by businesses seeking to increase profits and drive better decision-making, the use of big data started being applied to other fields, such as medicine, engineering, and social sciences.

Next Part

In the next part, I will show you how to implement the Abstractive approach from scratch without using a predefined library to do it all for you (I will show you how to create the model, train it, and everything).
You can find the next part here.

--

--

Lidor ES

Data Scientist & Engineer and Software Engineering student