- Extractive methods — Involves the selection of phrases and sentences from the source document to make up the new summary.
- Abstractive methods- It involves generating entirely new phrases and sentences to capture the meaning of the source document.
Gensim is a free Python library designed to automatically extract semantic topics from documents.
The gensim implementation is based on the popular TextRank algorithm.
It is an open-source vector space modelling and topic modelling toolkit, implemented in the Python programming language, using NumPy, SciPy and optionally Cython for performance.
Text Summarisation with Gensim (TextRank algorithm)-
We use the summarization.summarizer from gensim.
This summarising is based on ranks of text sentences using a variation of the TextRank algorithm.
TextRank is a general purpose, graph based ranking algorithm for NLP.
TextRank is an automatic summarisation technique.
Graph-based ranking algorithms are a way for deciding the importance of a vertex within a graph, based on global information recursively drawn from the entire graph.
TextRank Model -
The basic idea implemented by a graph-based ranking model is that of voting or recommendation.
When one vertex links to another one, it is basically casting a vote for that vertex. The higher the number of votes cast for a vertex, the higher the importance of that vertex.
Text as a graph -
We have to build a graph that represents the text, interconnects words or other text entities with meaningful relations.
TextRank includes two NLP tasks-
- Keyword extraction task
- Sentence extraction task
Keyword Extraction -
The task of keyword extraction algorithm is to automatically identify in a text a set of terms that best describe the document.
The simplest possible approach is to use a frequency criterion.
HOWEVER, this leads to poor results.
The TextRank keyword extraction algorithm is fully unsupervised. No training is necessary.
Sentence Extraction -
TextRank is very well suited for applications involving entire sentences, since it allows for a ranking over text units that is recursively computed based on information drawn from the entire text.
To apply TextRank, we first build a graph associated with the text, where the graph vertices are representative for the units to be ranked. The goal is to rank entire sentences, therefore, a vertex is added to the graph for each sentence in the text.
PageRank Algorithm -
It is the foundation of TextRank.
- PageRank used by Google search.
- Used to compute the rank of web pages. It is not named after its use (ranking pages) but after its creator Larry Page.
- Important pages are linked by important pages.
- The PageRank value of a page is the probability of a user visiting that page.
In TextRank, the only difference is that we consider sentences instead of pages.