Text Summarisation with Gensim (TextRank Algorithm)

Image for post
Image for post
  1. Extractive methods — Involves the selection of phrases and sentences from the source document to make up the new summary.
  2. Abstractive methods- It involves generating entirely new phrases and sentences to capture the meaning of the source document.

Gensim is a free Python library designed to automatically extract semantic topics from documents.

The gensim implementation is based on the popular TextRank algorithm.

It is an open-source vector space modelling and topic modelling toolkit, implemented in the Python programming language, using NumPy, SciPy and optionally Cython for performance.

Text Summarisation with Gensim (TextRank algorithm)-

We use the summarization.summarizer from gensim.

This summarising is based on ranks of text sentences using a variation of the TextRank algorithm.

TextRank is a general purpose, graph based ranking algorithm for NLP.

TextRank is an automatic summarisation technique.

Graph-based ranking algorithms are a way for deciding the importance of a vertex within a graph, based on global information recursively drawn from the entire graph.

TextRank Model -

The basic idea implemented by a graph-based ranking model is that of voting or recommendation.

When one vertex links to another one, it is basically casting a vote for that vertex. The higher the number of votes cast for a vertex, the higher the importance of that vertex.

Text as a graph -

We have to build a graph that represents the text, interconnects words or other text entities with meaningful relations.

TextRank includes two NLP tasks-

  1. Keyword extraction task
  2. Sentence extraction task

Keyword Extraction -

The task of keyword extraction algorithm is to automatically identify in a text a set of terms that best describe the document.

The simplest possible approach is to use a frequency criterion.

HOWEVER, this leads to poor results.

The TextRank keyword extraction algorithm is fully unsupervised. No training is necessary.

Sentence Extraction -

TextRank is very well suited for applications involving entire sentences, since it allows for a ranking over text units that is recursively computed based on information drawn from the entire text.

To apply TextRank, we first build a graph associated with the text, where the graph vertices are representative for the units to be ranked. The goal is to rank entire sentences, therefore, a vertex is added to the graph for each sentence in the text.

Image for post
Image for post
Image for post
Image for post

PageRank Algorithm -

It is the foundation of TextRank.

  • PageRank used by Google search.
  • Used to compute the rank of web pages. It is not named after its use (ranking pages) but after its creator Larry Page.

Fundamentals -

  • Important pages are linked by important pages.
  • The PageRank value of a page is the probability of a user visiting that page.

In TextRank, the only difference is that we consider sentences instead of pages.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store