Exploring Different Keyword Extractors — Graph Based Approaches

Ishan Shrivastava
May 21, 2020 · 10 min read
Image for post
Image for post
Image Source

Continuing the series of blogs on different keyword extractors, this blog brings us to the Graph Based approaches. We will cover what inspired the researchers to start exploring a graphical solution for Keyword Extraction and we will then discuss the four Graph Based approaches (TextRank, SingleRank, TopicRank and PositionRank). If you would like to read up on different Statistical Approaches, please refer to the first blog in this series.

Introduction — Graph Based Approaches

What is important to understand is how the graph-based ranking models employ a voting strategy to rank the vertices of a graph. The linking of one vertex to another is considered as a vote for the other vertex. The importance of the vertex casting the vote also determines the significance of the vote as well which is taken into consideration by the ranking model as well. So the vertex’s score is determined based on not just the votes it got but also on the scores of the vertices casting these votes.

Let us formally define the PageRank algorithm. Let G = (V,E) be a directed graph with the set of vertices V and set of edges E, where E is a subset of V x V. For a given vertex Vi, let In(Vi) be the set of vertices that point to it (predecessors), and let Out(Vi) be the set of vertices that vertex Vi points to (successors). The score of vertex Vi is determined by:

Image for post
Image for post

where d is a damping factor that can be set between 0 and 1. The damping factor determines the probability of jumping from a given vertex to another random vertex in the graph.

In the context of a Random Walk, a user starts from a vertex with a probability of d and jumps to a completely new vertex with a probability of 1 — d. Starting from arbitrary values assigned to each node, the iterative computation goes on until convergence is achieved which is decided based on a given threshold. Upon convergence the score associated with each of the vertices signifies its importance.

A similar line of thought can be applied to semantic or lexical graphs created from documents. This has given rise to many graph based ranking models, where knowledge drawn from an entire document is used in making local ranking decisions.

In the following section we shall look at TextRank, SingleRank, TopicRank and PositionRank algorithms in detail.

TextRank

The original PageRank algorithm assumes an unweighted graph. But the graphs for TextRank are built from Natural Language Text and therefore would involve many links between the tokens extracted from the text. Therefore incorporating the strengths of these relationships between the vertices of the graph would be useful. For this reason, TextRank is applied on a Weighted graph.

Consequently, TextRank introduces the following modification to the PageRank algorithm(introduced above) that takes into account the weights of the edges in the graph:

Image for post
Image for post

Here, wij indicates the strength of the relationship between the vertices Vi and Vj.

Preprocessing

Initially all uni-grams are considered as candidates.To generate higher n-gram keywords there is a post processing step which is involved which we will look at a bit later.

Graph Construction

Once the final scores are assigned to all the vertices, the TextRank algorithm retains top T scoring vertices for the Post Processing step. T is usually considered as one-third of the total vertices in the graph or oftentimes it can range from 5–20.

Post Processing

Finally the top K keywords/key phrases among all the n-grams are chosen based on their scores to be the final output of the algorithm.

SingleRank

The second way, SingleRank differs from TextRank is in the number Uni-grams (vertices) it retains as potential keywords after running the PageRank algorithm. SingleRank retains all the uni-grams unlike TextRank which only retains the top one-third of the vertices based on their scores. Reasoning behind this is that two low scoring uni-grams might result in a high scoring bi-gram. Just like TextRank, SingleRank also collapses the adjacent unigrams found within a window size to generate higher ngrams. Unlike TextRank, the authors of SingleRank experimented and found 10 to be a better window size than 2. This also enables SingleRank to generate n-grams with n>2.

TopicRank

The TopicRank algorithm consists of the following steps:

  1. Topic Identification
  2. Graph Based Ranking
  3. Keyphrase Selection

1. Topic Identification

Here keyphrases with at least 25% overlapping words are considered similar. Each of the tokens present in the keyphrases are stemmed into root forms. In order to automate this step of grouping similar keyphrases, the authors used Hierarchical Agglomerative Clustering algorithm.

2. Graph Based Ranking

Image for post
Image for post
Image for post
Image for post

where dist(ci;cj) refers to the reciprocal dis- tances between the offset positions of the candidate keyphrases ci and cj in the document and where pos(ci) represents all the offset positions of the candidate keyphrase ci. Unlike TextRank or SingleRank, TopicRank doesn’t require the need of choosing a window size because it instead computes the weights based on the distances between offset positions. The next step is to apply the Graph-Based TextRank algorithm to rank the topics.

3. Keyphrase Selection

PositionRank

  1. Graph Construction at Word Level
  2. Position Biased PageRank
  3. Formation of Candidate Phrases

1. Graph Construction at Word Level

2. Position Biased PageRank

Image for post
Image for post

The next step resembles the algorithm of Personalized PageRank where a vector of size equal to the number of vertices is selected and the Random Walk is allowed to jump or ‘teleport’ to any other node in the graph with the probability given by the vector.

Image for post
Image for post

Here S is the PageRank scores, α is the damping factor that facilitates the ‘teleport’ operation and pis the teleportation vector which indicates that, being in a node vi, the random walk can teleport to any other node in graph with equal probability.

This teleportation vector can be used to personalize or in this case inject bias such that the random walk prefers a node that has higher probability in the graph. The way this is done is as follows:

  1. Each candidate word is weighed with its inverse position in the document before applying any syntactic filtering.
  2. If the same word appears multiple times in the target document all of the inverse position weights are summed. This helps in weighing the frequently occurring words

As an example, a word at 2nd, 5th and 10th position will get a weight of 12+15+110=0.8

Finally the PageRank scores are recursively computed until the difference between two consecutive iterations is less than 0.001 or a number of 100 iterations is reached.

3. Formation of Candidate Phrases

Summary

STAY TUNED…

About Me: Graduated with a Masters in Computer Science from ASU. I am a NLP Scientist at GumGum. I am interested in applying Machine Learning/Deep Learning to provide some structure to the unstructured data that surrounds us.

We’re always looking for new talent! View jobs.

Follow us: Facebook | Twitter | | Linkedin | Instagram

gumgum-tech

Thoughts from the GumGum tech team

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store