(NLP) TextRank

CHEN TSU PEI
NLP-trend-and-review-en
3 min readJan 18, 2018

Now, I am working on a project trying to extract keywords from comments.

To begin, I did survey about keyword extraction. There are two renowned methods to reach the goal:

  1. TF-IDF — frequency-based
  2. TextRank — graph-based
  3. Supervised learning — based on lexical and syntactic features

Due to the lack of training data, in my case, I will focus on unsupervised method. TextRank looks like a feasible choice! Let’s get started!

The idea of TextRank is similar to PageRank, if you have already known something about PageRank, it would be easy to you.

Let’s take a glance at PageRank. Image each webpage as a node, and there is a connection between nodes if a page is hyperlinked by others. So, If a webpage has a lot of connection with others, it could be more possible that the page is more crucial than others. Just in the same theory, we could measure importance of a word by TextRank.

Between nodes(or vertices), there are links, and each vertex could vote for other vertices.

from TextRank By Rada Mihalcea and Paul Tarau

Following this equation, each vertex could finally gain a score after numbers of iterations.

In this paper, comparison between directed graph and undirected graph is performed(also weighted graph and unweighted graph). For undirected graph, the convergence curve would be more gradual than directed graph.

So next, how should we convert text into a graph?

Intuitively, if we want to do keyword extraction, then we should use words as vertices, while using sentences as vertices in the case of text summary.

The biggest problem is how to define the relation(or the edges) between vertices. Here, the paper uses a relation called co-occurence relation. If there are two words co-occur in the distance less than a window size N, then we say they have links.(N could be set between 2 and 10)

Until now, we have already figured out the basic concept of TextRank. We use words as vertices of graph, and define the relation between vertices by co-ocuurence relation.

To make it better, we could apply syntactic filters to the vertices, for examples, only add verbs and nouns to the graph. According to the experiment of the paper, choosing only adjectives and nouns gave a better result.

Let’s dive deeper to the experiment designed on this paper. The followings are simplified steps of the experiment.

  1. tokenize the text
  2. annotated with part of speech tags
  3. pass words through syntactic filiter
  4. create links by the definition of co-occurence in window size N
  5. get an undirected and unweighted graph
  6. initialize each vertex with value 1
  7. iteration 20~30 times until below the threshold 0.0001
  8. Finally, get T keywords (which T is set to a third of the number of vertices in the graph)

PS. TextRank is an unsupervised algorithm and we only consider single keyword here.

Reference

TextRank: Bringing Order into Texts

--

--

CHEN TSU PEI
NLP-trend-and-review-en

這邊停止更新了!麻煩移駕到https://tsupei.github.io,有持續更新更多NLP的文章唷!