Two minutes NLP — Keyword and keyphrase extraction with PKE

PKE, TextRank, TopicRank, and YAKE!

Fabio Chiusano
NLPlanet

--

Photo by Maria Ziegler on Unsplash

Hello fellow NLP enthusiasts! I recently explored several approaches to extracting keywords from documents and came across this quite famous PKE library. It makes use of graph-based and statistical algorithms to extract keywords and keyphrases, something simpler (but not for this reason to be neglected) than the omnipresent transformer-based models. Enjoy! 😄

PKE (Python Keyphrase Extraction) is an open-source python-based keyword and keyphrase extraction library. It provides an end-to-end keyphrase extraction pipeline in which each component can be easily modified or extended to develop new models.

Unsupervised keyphrase extraction models

PKE currently implements the following graph-based models:

Graph-based models work more or less following this process:

  1. Identify text units that may be candidate keywords or keyphrases, and add them as vertices in a graph.
  2. Identify relations that connect such text units, and use these relations to draw edges between vertices in the graph.
  3. Iterate a graph-based ranking algorithm (e.g. PageRank) until convergence. A graph-based ranking algorithm gives a score to each vertex, which represents its “importance” in the graph.
  4. Sort vertices based on their score and select the best ones as keywords or keyphrases.

PKE also implements the following statistical models:

Statistical models are based on occurrence-based statistics computed on the documents, which are often combined with heuristics to obtain a score for each candidate keyword or keyphrase.

Let’s see how some of these algorithms work with code samples.

TextRank

The TextRank algorithm is a graph-based keyphrase extraction model.

It builds its graph using nouns and adjectives as vertices and adds edges between them whenever such words occur near each other in the documents (allowing for a certain number of interleaving words between them).

Then, a graph-based ranking algorithm (derived from Google’s PageRank) is used to give scores to each candidate keyword. Finally, adjacent candidate keywords with high scores are merged into keyphrases, and the ones with the highest scores are the output of the algorithm.

Here is an example of applying TextRank to an excerpt of the Wikipedia page of Python.

TopicRank

TopicRank is another graph-based keyphrase extraction algorithm, but, differently from TextRank, the candidate keyphrases are the longest noun phrases in the documents.

These noun phrases are then grouped into topics, where two phrases are grouped if they have more than 25% of overlapping words.

Then, TopicRank builds a graph using topics as vertices and adds edges between each vertex, using a weight proportional to how close the two topics appear in the document.

Next, the same graph-based ranking algorithm of TextRank is used and the topics with the best scores are selected. The algorithm eventually outputs one representative keyphrase from each of the selected topics.

Let’s see a sample code with TopicRank.

YAKE!

YAKE! is a statistical keyphrase extraction model.

It tokenizes the documents into word tokens and then computes some statistics for each of them, such as their number of occurrences, if they appear in sentences with many different words, if they are cased, etc.

These statistics are then combined into a final score, which is used to:

  • Return the words with the highest score, i.e. the extracted keywords;
  • Combine adjacent words with high scores into keyphrases.

Let’s see a sample code with YAKE!.

--

--

Fabio Chiusano
NLPlanet

Freelance data scientist — Top Medium writer in Artificial Intelligence