Understanding Text Summarization using K-means Clustering

Akanksha Gupta
7 min readSep 29, 2020

--

In this era, when the internet is flooded with enormous amounts of information, it has become very difficult and time consuming to extract and consume important and relevant information manually. An automatic text summarizer picks the most meaningful and pertinent information and compresses it to a shorter version preserving its original meaning. It creates a short and accurate summary that presents the most important information and keeps busy readers informed without demanding more time than necessary to get the information they need.

In this article, we will develop an extractive based automatic text summarizer using Word2Vec and K-means in python. But before starting lets quickly understand what extractive summarization means.

In general, text summarization methods can be classified into abstractive and extractive summarization.

Extraction involves picking relevant extracts from the corpus and concatenating them together. Thus they depend on identifying important sentences or phrases that can paint the entire picture of the document and pasting them together to form a precise and accurate summary.

Abstraction on the other hand involves generating novel sentences from information extracted from the corpus. It aims at producing more human-like summaries by interpreting the text and rephrasing it, thus generating shorter texts which may or may not be present in the original document. Abstraction may be hard to implement and use advanced NLP techniques to generate human-level quality summaries.

Now, without any further ado, let’s open our jupyter notebooks and start coding!

Original Text:

Artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, unlike the natural intelligence displayed by humans and animals. Leading AI textbooks define the field as the study of "intelligent agents": any device that perceives its environment and takes actions that maximize its chance of successfully achieving its goals. Colloquially, the term "artificial intelligence" is often used to describe machines (or computers) that mimic "cognitive" functions that humans associate with the human mind, such as "learning" and "problem solving".

As machines become increasingly capable, tasks considered to require "intelligence" are often removed from the definition of AI, a phenomenon known as the AI effect. A quip in Tesler's Theorem says "AI is whatever hasn't been done yet." For instance, optical character recognition is frequently excluded from things considered to be AI, having become a routine technology. Modern machine capabilities generally classified as AI include successfully understanding human speech, competing at the highest level in strategic game systems (such as chess and Go), autonomously operating cars, intelligent routing in content delivery networks, and military simulations.

Artificial intelligence was founded as an academic discipline in 1955, and in the years since has experienced several waves of optimism, followed by disappointment and the loss of funding (known as an "AI winter"), followed by new approaches, success and renewed funding. For most of its history, AI research has been divided into sub-fields that often fail to communicate with each other. These sub-fields are based on technical considerations, such as particular goals (e.g "robotics" or "machine learning"), the use of particular tools ("logic" or artificial neural networks), or deep philosophical differences. Sub-fields have also been based on social factors (particular institutions or the work of particular researchers).

The traditional problems (or goals) of AI research include reasoning, knowledge representation, planning, learning, natural language processing, perception and the ability to move and manipulate objects. General intelligence is among the field's long-term goals. Approaches include statistical methods, computational intelligence, and traditional symbolic AI. Many tools are used in AI, including versions of search and mathematical optimization, artificial neural networks, and methods based on statistics, probability and economics. The AI field draws upon computer science, information engineering, mathematics, psychology, linguistics, philosophy, and many other fields.

The field was founded on the assumption that human intelligence "can be so precisely described that a machine can be made to simulate it". This raises philosophical arguments about the mind and the ethics of creating artificial beings endowed with human-like intelligence. These issues have been explored by myth, fiction and philosophy since antiquity. Some people also consider AI to be a danger to humanity if it progresses unabated. Others believe that AI, unlike previous technological revolutions, will create a risk of mass unemployment.

In the twenty-first century, AI techniques have experienced a resurgence following concurrent advances in computer power, large amounts of data, and theoretical understanding; and AI techniques have become an essential part of the technology industry, helping to solve many challenging problems in computer science, software engineering and operations research.'''

(Source: https://en.wikipedia.org/wiki/Artificial_intelligence )

1. Split Text into Sentences

So the first step is to tokenize text into sentences. For this, we will use the sent_tokenize function from nltk.tokenize.punkt module which is pre-trained to model abbreviations, characters and punctuations to mark the beginning and end of sentences.

import nltk
nltk.download('punkt') # one time execution
from nltk.tokenize import sent_tokenize
sentence = sent_tokenize(text)

2. Generate Clean Text

We need to clean the text as much as we can in order to ease the learning process of the future machinery model. It includes 3 main steps:

  1. Replacing every element that is not a letter by space
  2. Transforming capital letters into lowercase
  3. Removing stopwords which are commonly used words like articles, prepositions which do not carry any information and don’t help us towards our goal of summarization.
import re
nltk.download('stopwords') # one time execution
from nltk.corpus import stopwords
corpus = []
for i in range(len(sentence)):
sen = re.sub('[^a-zA-Z]', " ", sentence[i])
sen = sen.lower()
sen = sen.split()
sen = ' '.join([i for i in sen if i not in stopwords.words('english')])
corpus.append(sen)

3. Vector Representation of Sentences

Word embeddings like Word2Vec are basically a form of vector representation of words to bridge the gap between human understanding of language to that of a machine. You can also use the Bag-Of-Words model or TF-IDF matrix for this purpose but they only focus on creating a sparse matrix by mapping each word and counting the number of times each word in the vocabulary appears in the document rather than attempting to capture a word’s relation with other words.

from gensim.models import Word2Vec
all_words = [i.split() for i in corpus]
model = Word2Vec(all_words, min_count=1,size= 300)

For creating sentence vectors, we will grab vector representation for all the constituent words in a sentence and then take average to arrive at an amalgamated vector.

sent_vector=[]
for i in corpus:

plus=0
for j in i.split():
plus+= model.wv[j]
plus = plus/len(i.split())

sent_vector.append(plus)

4. Clustering

After forming sentence vectors we will model and perform clustering to group sentence embeddings into a pre-defined number of clusters which is equal to the desired number of sentences we want in our summary. In this case, I have chosen the number of centroids to be equal to five.

import numpy as np
from sklearn.cluster import KMeans
n_clusters = 5
kmeans = KMeans(n_clusters, init = 'k-means++', random_state = 42)
y_kmeans = kmeans.fit_predict(sent_vector)
Clustering semantically identical sentences together

Each cluster of sentence embeddings can be interpreted as a group of semantically identical sentences which more or less carries the same information and whose meaning can be represented by only one sentence from the cluster. The sentence vector which has the minimum euclidean distance from the cluster centroid represents the whole group. These sentences from each cluster are ordered in the similar fashion as the original text to form a meaningful summary.

from scipy.spatial import distance
my_list=[]
for i in range(n_clusters):
my_dict={}

for j in range(len(y_kmeans)):

if y_kmeans[j]==i:
my_dict[j] = distance.euclidean(kmeans.cluster_centers_[i],sent_vector[j])
min_distance = min(my_dict.values())
my_list.append(min(my_dict, key=my_dict.get))


for i in sorted(my_list):
print(sentence[i])

Summarized text

Artificial intelligence (AI), sometimes called machine intelligence, is intelligence demonstrated by machines, unlike the natural intelligence displayed by humans and animals. Modern machine capabilities generally classified as AI include successfully understanding human speech, competing at the highest level in strategic game systems (such as chess and Go), autonomously operating cars, intelligent routing in content delivery networks, and military simulations. For most of its history, AI research has been divided into sub-fields that often fail to communicate with each other. These sub-fields are based on technical considerations, such as particular goals (e.g "robotics" or "machine learning"), the use of particular tools ("logic" or artificial neural networks), or deep philosophical differences. General intelligence is among the field's long-term goals. In the twenty-first century, AI techniques have experienced a resurgence following concurrent advances in computer power, large amounts of data, and theoretical understanding; and AI techniques have become an essential part of the technology industry, helping to solve many challenging problems in computer science, software engineering and operations research.

Complete source code can be found here.

I hope this article gave you a decent idea of text summarization along with the sample demonstration of code. You can also create a cosine similarity matrix followed by TextRank algorithm (derivative of PageRank by google) after generating Word2vec embeddings, code for which can be found in this Github repository.

Perhaps I feel that K-means performed much better in extracting sentences from each of the different semantic themes in the text. Therefore we see an entire coverage of the original text.

You can also scrape data from different sites and perform automatic summarization. I’ll add the code for web scraping using requests and BeautifulSoup on my Github.

Till then, Happy Summarizing!!

For any questions or suggestions feel free to ping me on my Linkedin account!

--

--

Akanksha Gupta

Machine Learning | Deep learning | Electrical Engineering