NLP — Sentence Extraction using NLTK: TextRank Algorithm

Easy implementation using Python and NLTK

Akash Panchal
Analytics Vidhya
5 min readSep 5, 2019

--

Photo by NORTHFOLK on Unsplash

Introduction

TextRank is an algorithm based on PageRank, which often used in keyword extraction and text summarization.

We will implement the TextRank Algorithm for Sentence Extraction in Python. The crux of this algorithm is to fetch the most relevant Sentences form the piece of the text, which is one of the most important tasks of Extractive Text Summarization.

But, Let’s not re-invent the wheel

The prerequisite for this Article is the understanding of the PageRank Algorithm, which you can read from the following article on Medium:

PageRank (PR) is an algorithm used to calculate the weight for web pages, whcih is used by Google Search to rank web pages in their search engine results.

Please refer to one of the following articles to get the basic understanding:

Brendan Massey explains the crux behind PageRank Algo with lots of images.

Xu LIANG has explained it very well with python implementation.

What are we doing?

We will try to extract top sentences from the piece of text using TextRank Algorithm.

Approach:

We’ll do this in 3 steps. Yes, Promise! (Ok, Maybe 4.)

  1. Tokenize words in each sentence

This will generate a list of list of tokenized sentences:

2. Build a Similarity matrix

We will use cosine similarity to find the similarity between two sentences, which will be used to measure the distance between two sentences.

Cosine Similarity: Cosine similarity is a metric used to determine how similar the documents are irrespective of their size.

As a similarity metric, how does cosine similarity differ from the number of common words?

When plotted on a multi-dimensional space, where each dimension corresponds to a word in the document, the cosine similarity captures the orientation (the angle) of the documents and not the magnitude.

ie. Consider the following pair of sentences and their cosine similarity

Similarly, we need to measure a similarity matrix between all the sentences.

We will get a similarity matrix like the following:

3. Run PageRank Algorithm

Now that we have the similarity matrix, we can run the PageRank algorithm on it. If you have followed the PageRank Article, the following code is similar and easy to understand.

We will generate the PageRank matrix which will be having the score of all sentences with the most important sentence having the highest score.

We will get a PageRank matrix as follows:

4. Extract top sentences

And Voila, we’re done. You’ve guessed this step, didn’t you?

Now we will extract top sentences from the PageRank matrix.

And here are the top 5 sentences: {Sentence: Score} pair

# Everything in one place: Peacefulness!

Find the full code here and play with it!

Technology does not Automatically improve, It improves when a lot of people work hard to make it better — Elon Musk

--

--