NLP — Sentence Extraction using NLTK: TextRank Algorithm

Easy implementation using Python and NLTK

Akash Panchal
Analytics Vidhya

--

Photo by NORTHFOLK on Unsplash

Introduction

TextRank is an algorithm based on PageRank, which often used in keyword extraction and text summarization.

We will implement the TextRank Algorithm for Sentence Extraction in Python. The crux of this algorithm is to fetch the most relevant Sentences form the piece of the text, which is one of the most important tasks of Extractive Text Summarization.

But, Let’s not re-invent the wheel

The prerequisite for this Article is the understanding of the PageRank Algorithm, which you can read from the following article on Medium:

PageRank (PR) is an algorithm used to calculate the weight for web pages, whcih is used by Google Search to rank web pages in their search engine results.

Please refer to one of the following articles to get the basic understanding:

Brendan Massey explains the crux behind PageRank Algo with lots of images.

Xu LIANG has explained it very well with python implementation.

What are we doing?

We will try to extract top sentences from the piece of text using TextRank Algorithm.

Those Who Are Resilient Stay In The Game Longer
“On the mountains of truth you can never climb in vain: either you will reach a point higher up today, or you will be training your powers so that you will be able to climb higher tomorrow.” — Friedrich Nietzsche
Challenges and setbacks are not meant to defeat you, but promote you. However, I realise after many years of defeats, it can crush your spirit and it is easier to give up than risk further setbacks and disappointments. Have you experienced this before? To be honest, I don’t have the answers. I can’t tell you what the right course of action is; only you will know. However, it’s important not to be discouraged by failure when pursuing a goal or a dream, since failure itself means different things to different people. To a person with a Fixed Mindset failure is a blow to their self-esteem, yet to a person with a Growth Mindset, it’s an opportunity to improve and find new ways to overcome their obstacles. Same failure, yet different responses. Who is right and who is wrong? Neither. Each person has a different mindset that decides their outcome. Those who are resilient stay in the game longer and draw on their inner means to succeed.

Approach:

We’ll do this in 3 steps. Yes, Promise! (Ok, Maybe 4.)

  1. Tokenize words in each sentence

This will generate a list of list of tokenized sentences:

[['Those', 'Who', 'Are', 'Resilient', 'Stay', 'In', 'The', 'Game', 'Longer', '*', '“', 'On', 'the', 'mountains', 'of', 'truth', 'you', 'can', 'never', 'climb', 'in', 'vain', ':', 'either', 'you', 'will', 'reach', 'a', 'point', 'higher', 'up', 'today', ',', 'or', 'you', 'will', 'be', 'training', 'your', 'powers', 'so', 'that', 'you', 'will', 'be', 'able', 'to', 'climb', 'higher', 'tomorrow.', '”', '—', 'Friedrich', 'Nietzsche', 'Challenges', 'and', 'setbacks', 'are', 'not', 'meant', 'to', 'defeat', 'you', ',', 'but', 'promote', 'you', '.'], ...]

2. Build a Similarity matrix

We will use cosine similarity to find the similarity between two sentences, which will be used to measure the distance between two sentences.

Cosine Similarity: Cosine similarity is a metric used to determine how similar the documents are irrespective of their size.

As a similarity metric, how does cosine similarity differ from the number of common words?

When plotted on a multi-dimensional space, where each dimension corresponds to a word in the document, the cosine similarity captures the orientation (the angle) of the documents and not the magnitude.

ie. Consider the following pair of sentences and their cosine similarity

1. "NLP is interesting field of AI", And 
"NLP makes machine understand language"
Consine Similarity: 0.7238730093922876
2. "Alan turing proposed the imitation game", And
"Alan turing hated other games"
Consine Similarity: 0.91504896898793
3. "NLP is not just about chatbots", And
"NLP makes chatbots widely used"
Consine Similarity: 0.7764453512724155
4. "NLP is interesting field of AI", And
"NLP is interesting field of AI"
Consine Similarity: 1.0
5. "Those Who Are Resilient Stay In The Game Longer", And
"01234567890"
Consine Similarity: 0.0

Similarly, we need to measure a similarity matrix between all the sentences.

We will get a similarity matrix like the following:

[[0.         0.11395845 0.35076499 0.10679337 0.22030768 0.09455048
0.08854208 0.05757988 0.09019807 0.04691224 0.05544081 0.20244412]
[0.09867469 0. 0. 0.10985716 0.10198264 0.11671565
0.11710577 0.10661723 0.22268582 0.08686458 0.03421881 0.08820086]...

3. Run PageRank Algorithm

Now that we have the similarity matrix, we can run the PageRank algorithm on it. If you have followed the PageRank Article, the following code is similar and easy to understand.

We will generate the PageRank matrix which will be having the score of all sentences with the most important sentence having the highest score.

We will get a PageRank matrix as follows:

[1.26778728 1.09189214 0.49899918 1.24743264 1.08652989 1.24053865
1.24198464 0.99690567 0.61024186 0.81140037 0.85049162 1.05579606]

4. Extract top sentences

And Voila, we’re done. You’ve guessed this step, didn’t you?

Now we will extract top sentences from the PageRank matrix.

And here are the top 5 sentences: {Sentence: Score} pair

{'\nThose Who Are Resilient Stay In The Game Longer\n“On the mountains of truth you can never climb in vain: either you will reach a point higher up today, or you will be training your powers so that you will be able to climb higher tomorrow.” — Friedrich Nietzsche\nChallenges and setbacks are not meant to defeat you, but promote you.': 1.2712476366837975, 'However, I realise after many years of defeats, it can crush your spirit and it is easier to give up than risk further setbacks and disappointments.': 1.0914952900297563, 'However, it’s important not to be discouraged by failure when pursuing a goal or a dream, since failure itself means different things to different people.': 1.2401359537858996, 'To a person with a Fixed Mindset failure is a blow to their self-esteem, yet to a person with a Growth Mindset, it’s an opportunity to improve and find new ways to overcome their obstacles.': 1.2415147486210913,'To be honest, I don’t have the answers.': 1.2468967412932273}

# Everything in one place: Peacefulness!

Find the full code here and play with it!

Technology does not Automatically improve, It improves when a lot of people work hard to make it better — Elon Musk

--

--