Exploring Different Keyword Extractors — Graph Based Approaches
Continuing the series of blogs on different keyword extractors, this blog brings us to the Graph Based approaches. We will cover what inspired the researchers to start exploring a graphical solution for Keyword Extraction and we will then discuss the four Graph Based approaches (TextRank, SingleRank, TopicRank and PositionRank). If you would like to read up on different Statistical Approaches, please refer to the first blog in this series.
Introduction — Graph Based Approaches
All the graph based approaches employ a ranking algorithm like HITS or PageRank. These algorithms compute the importance of a vertex in the graph. While doing so they not only rely on the local vertex-specific information but also consider the global information that is recursively computed from the entire graph.
What is important to understand is how the graph-based ranking models employ a voting strategy to rank the vertices of a graph. The linking of one vertex to another is considered as a vote for the other vertex. The importance of the vertex casting the vote also determines the significance of the vote as well which is taken into consideration by the ranking model as well. So the vertex’s score is determined based on not just the votes it got but also on the scores of the vertices casting these votes.
Let us formally define the PageRank algorithm. Let G = (V,E) be a directed graph with the set of vertices V and set of edges E, where E is a subset of V x V. For a given vertex Vi, let In(Vi) be the set of vertices that point to it (predecessors), and let Out(Vi) be the set of vertices that vertex Vi points to (successors). The score of vertex Vi is determined by:
where d is a damping factor that can be set between 0 and 1. The damping factor determines the probability of jumping from a given vertex to another random vertex in the graph.
In the context of a Random Walk, a user starts from a vertex with a probability of d and jumps to a completely new vertex with a probability of 1 — d. Starting from arbitrary values assigned to each node, the iterative computation goes on until convergence is achieved which is decided based on a given threshold. Upon convergence the score associated with each of the vertices signifies its importance.
A similar line of thought can be applied to semantic or lexical graphs created from documents. This has given rise to many graph based ranking models, where knowledge drawn from an entire document is used in making local ranking decisions.
In the following section we shall look at TextRank, SingleRank, TopicRank and PositionRank algorithms in detail.
TextRank algorithm is essentially applying the PageRank algorithm on a graph where the vertices correspond to the words. The details are in preprocessing and how the graph is constructed. An important aspect of TextRank is that it does not require deep linguistic knowledge, nor domain or language specific annotated corpora, which makes it highly portable to other domains, genres, or languages.
The original PageRank algorithm assumes an unweighted graph. But the graphs for TextRank are built from Natural Language Text and therefore would involve many links between the tokens extracted from the text. Therefore incorporating the strengths of these relationships between the vertices of the graph would be useful. For this reason, TextRank is applied on a Weighted graph.
Consequently, TextRank introduces the following modification to the PageRank algorithm(introduced above) that takes into account the weights of the edges in the graph:
Here, wij indicates the strength of the relationship between the vertices Vi and Vj.
The first step is to tokenize the text and then annotate them with POS tags. POS tags are used as syntactic filters such that only certain POS tags are used in the construction of graphs. The authors of TextRank performed experiments and observed best results when considering Nouns and Adjectives only.
Initially all uni-grams are considered as candidates.To generate higher n-gram keywords there is a post processing step which is involved which we will look at a bit later.
The unigrams that pass the syntactic filter of having POS tags of only Noun or Adjectives are added to the graph as vertices. To assign an edge, we look at those unigrams that co-occur within a window of N words, where N is a hyper-parameter. The graph thus constructed is undirected and unweighted because currently all the edges have the same weight of 1. Now the modified PageRank algorithm shown above is run for several iterations until it converges. It usually is run for 20–30 iterations with a threshold set to 0.0001.
Once the final scores are assigned to all the vertices, the TextRank algorithm retains top T scoring vertices for the Post Processing step. T is usually considered as one-third of the total vertices in the graph or oftentimes it can range from 5–20.
All uni-grams(vertices) retained from the previous step are selected as potential keywords by the TextRank algorithm. The sequences of adjacent keywords are collapsed into higher n-gram keywords. To select the adjacent keywords usually a window size is chosen which is a hyper-parameter. The authors found out 2 to be a good window size. The individual unigrams scores are summed in order to assign a score to these n-grams.
Finally the top K keywords/key phrases among all the n-grams are chosen based on their scores to be the final output of the algorithm.
Single rank is an extension of TextRank with two major differences. Just like TextRank the vertices in the graph are uni-grams that pass the syntactic filter and the edges are also assigned based on the co-occurrence of words within a window. The first major difference is that these edges are assigned a weight as well as opposed to having an unweighted(all vertices have the same weight) graph in TextRank. The weights are computed based on the distance between the two words that are found within the predefined window.
The second way, SingleRank differs from TextRank is in the number Uni-grams (vertices) it retains as potential keywords after running the PageRank algorithm. SingleRank retains all the uni-grams unlike TextRank which only retains the top one-third of the vertices based on their scores. Reasoning behind this is that two low scoring uni-grams might result in a high scoring bi-gram. Just like TextRank, SingleRank also collapses the adjacent unigrams found within a window size to generate higher ngrams. Unlike TextRank, the authors of SingleRank experimented and found 10 to be a better window size than 2. This also enables SingleRank to generate n-grams with n>2.
TopicRank employs a slightly different approach than TextRank and SingleRank. Here the idea is to extract keyphrases from the most important topics present in the document. This algorithm considers a Topic as a cluster of similar keyphrase candidates. These topics are then ranked according to their importance in the document and one keyphrase per each of the top most important topics are selected to be the most significant keyphrases for the given document.
The TopicRank algorithm consists of the following steps:
- Topic Identification
- Graph Based Ranking
- Keyphrase Selection
1. Topic Identification
The authors refer to Hulth (2003) which stated that most keyphrases assigned by human readers are Noun Phrases. For this reason they claim that most important topics of a document can be identified by extracting Noun Phrases. To extract Noun Phrases the authors followed the approach by Wan and Xiao (2008) which extracts the longest sequence of Nouns and Adjectives as keyphrase candidates.
Here keyphrases with at least 25% overlapping words are considered similar. Each of the tokens present in the keyphrases are stemmed into root forms. In order to automate this step of grouping similar keyphrases, the authors used Hierarchical Agglomerative Clustering algorithm.
2. Graph Based Ranking
The graph constructed here is a complete weighted graph where the vertices are topics and the edge between two topics ti and tj is assigned a weight based on the strength of the semantic relation between the two topics. A strong semantic relationship is determined based on how close the keyphrase candidates of the two topics appear in the document and is defined as follows:
where dist(ci;cj) refers to the reciprocal dis- tances between the offset positions of the candidate keyphrases ci and cj in the document and where pos(ci) represents all the offset positions of the candidate keyphrase ci. Unlike TextRank or SingleRank, TopicRank doesn’t require the need of choosing a window size because it instead computes the weights based on the distances between offset positions. The next step is to apply the Graph-Based TextRank algorithm to rank the topics.
3. Keyphrase Selection
Once the Topic ranking is done, top K topics are selected and the most significant keyphrase per topic is chosen to output K keyphrases for the given document. There are three strategies used to find the best keyphrase for a given topic. One of the strategies, converts all the keyphrases back to their generic form and choses the keyphrase that appeared first in the document. The second strategy choses the most frequent keyphrase while the third selects based on the centroid of the cluster. The centroid is the candidate that is the most similar to the other candidates of the cluster
PositionRank is also a graph based model that also incorporates the position of words and their frequencies in a document to compute a position biased PageRank score for each word. This algorithm has three steps:
- Graph Construction at Word Level
- Position Biased PageRank
- Formation of Candidate Phrases
1. Graph Construction at Word Level
Just as in TextRank, a syntactic filter is applied first to select only Nouns and Adjectives as candidate words. A weighted graph is then constructed with each of these selected candidate words as vertices. Two vertices are connected by an edge if the corresponding words co-occur within a window of contiguous tokens. The weight for these edges is determined by the co-occurrence count of the two words within the window of contiguous tokens.
2. Position Biased PageRank
Let the graph constructed above be G and M be its adjacency matrix. An element mij ∈ M is set to the weight of edge (vi,vj) if there exists an edge between nodes vi and vj, and is set to 0 otherwise. Let M be the Normalized form of Matrix M defined as:
The next step resembles the algorithm of Personalized PageRank where a vector of size equal to the number of vertices is selected and the Random Walk is allowed to jump or ‘teleport’ to any other node in the graph with the probability given by the vector.
Here S is the PageRank scores, α is the damping factor that facilitates the ‘teleport’ operation and pis the teleportation vector which indicates that, being in a node vi, the random walk can teleport to any other node in graph with equal probability.
This teleportation vector can be used to personalize or in this case inject bias such that the random walk prefers a node that has higher probability in the graph. The way this is done is as follows:
- Each candidate word is weighed with its inverse position in the document before applying any syntactic filtering.
- If the same word appears multiple times in the target document all of the inverse position weights are summed. This helps in weighing the frequently occurring words
As an example, a word at 2nd, 5th and 10th position will get a weight of 12+15+110=0.8
Finally the PageRank scores are recursively computed until the difference between two consecutive iterations is less than 0.001 or a number of 100 iterations is reached.
3. Formation of Candidate Phrases
Candidate words that occur together are collapsed into a phrase. Noun phrases are considered here, specifically those that match the regular expression of (adjective)*(noun)+ up to trigrams. The scores of the individual words are summed and the top scoring keyphrases are displayed as final results.
In this article, we covered what are the origins of a graphical method and how it can be employed to generate various keywords extraction methodologies. We looked at TextRank which builds upon the existing PageRank algorithm. We also covered SingleRank which is a slight modification of TextRank. We also looked at TopicRank which explores a different route by finding the ranks of different topics present in the document in order to extract relevant keywords. PositionRank on the other hand focuses more on the position of different terms in the document. In the next article we will look at how to evaluate these graphical as well as statistical approaches. We will learn about different Evaluation metrics such as F1, Mean Reciprocal Rank and NDCG. We will implement all of the keyword extractors and see how they perform on various datasets based on different evaluation metrics.
About Me: Graduated with a Masters in Computer Science from ASU. I am a NLP Scientist at GumGum. I am interested in applying Machine Learning/Deep Learning to provide some structure to the unstructured data that surrounds us.