Exploring Different Keyword Extractors — Evaluation Metrics and Strategies

Published in

GumGum Tech Blog

10 min readJul 14, 2020

It is of extreme importance that one understands the different evaluation metrics and when to use them. Evaluating your model on inadequate metrics and then judging your model based on the improvements achieved on these metrics is a huge trap. Often, especially in the Industry, these metrics are indicators for productionization of newer models. Therefore, as a Data Scientist, one should be aware of the pros and cons of different evaluation metrics in order to avoid falling in to this trap.

Evaluating a Keyword Extraction model is not as straightforward as it is to evaluate a model for a Classification problem. Keyword extraction is fundamentally a ranking task rather than a classification task, where we would expect to rank relevant keywords or key-phrases (Going forward I will be using these interchangeably) higher up the order than irrelevant key-phrases. When it comes to the evaluation of such systems we have to compare two lists of key-phrases. Traditional tasks such as classification tasks just predict which class a sample belongs to and therefore, do not consider any form of ranking during evaluation. Keyword extraction on the other hand requires Rank-Aware evaluation metrics. In future sections we will explore the shortcomings of the traditional evaluation metrics such as F1, Precision and Recall and then look at the following Rank-Aware metrics:

Mean Reciprocal Rank (MRR)
Mean Average Precision (MAP)
Normalized Discounted Cumulative Gain (nDCG)

Evaluation Metric: F1, Precision and Recall

The inherent assumption when using F1, Precision and Recall is that Keyword Extraction is considered as a classification problem. Here we consider a document as a set of tokens. The two classes are whether a token or a set of tokens is a keyword or not a keyword.

This inherent assumption makes this evaluation metric slightly weaker because Keyword Extraction is majorly a Ranking problem rather than a Classification problem. F1, Precision and Recall are considered as rank-less metrics because they do not address the ranks of the keywords at all.

Generally these metrics are not targeted to top-K extracted key-phrases. These can be bounded to top-K in the form of Precision@K and Recall@K. Their harmonic mean gives F1@K. Precision@K gives the percentage of the Top-K extracted key-phrases that are relevant. This does involve some sense on Top-K evaluation, but they still do not evaluate considering the rank. To understand the shortcomings of this approach let us consider the figure below which shows list1 as the output of one of our models (model A) and list2 as the output of another different (model B).

Both the models extract 6 relevant and 4 irrelevant key-phrases and therefore have a score of 0.6 Precision@10. But model B seems better because it extracts relevant key-phrases higher up the order as compared to model 1. Therefore we need metrics that are able to capture such differences as well.

These metrics are good at evaluating if we are good at finding relevant key-phrases but we need metrics that evaluate if we are good at finding and ranking relevant key-phrases. What this means is that given two models, we would like the metric to be able to differentiate between the two models based on which model extracts relevant key-phrases higher up the order. Therefore we need metrics such as Mean Reciprocal Rank, Mean Average Precision (MAP) and nDCG that allows us to evaluate the quality of the keywords generated based on their Ranks as well. Let’s look at these in the sections to follow.

Evaluation Metric: Mean Reciprocal Rank

Mean Reciprocal Rank is a measure to evaluate models that return a ranked list of key-phrases to documents. MRR only cares about the single highest-ranked relevant item. If the model returns a relevant key-phrase in the third-highest spot, then that’s what MRR cares about. It doesn’t care if the other relevant key-phrases (assuming there are any) are ranked number 4 or number 20.

MRR gives the averaged ranking of the first correct prediction

Where, d is the number of documents and rank_i is the rank at which the first correct key-phrase of document i was found. Consider the example below which shows the computation of MRR over a dataset with 3 documents:

PROS: Therefore, MRR is appropriate to judge a model when either

There’s only one relevant key-phrase.
Or, only the highest-ranked key-phrase is needed.

CONS:

If multiple key-phrases are expected and needs to be evaluated then this is not a good evaluation metric.
It gives an extracted key-phrase list with a single relevant key-phrase the same weight as the same with multiple relevant key-phrases.

Evaluation Metric: MAP (Mean Average Precision)

We previously saw Precision@K metric which used to give the percentage of relevant key-phrases among the top K extracted key-phrases. The drawback of this approach was that it does not consider the extracted list of key-phrases as an ordered list. In other words, it doesn’t evaluate considering the ranks at which relevant key-phrases are extracted.

The goal is to come up with a metric that penalizes if irrelevant key-phrases are extracted higher up the order and gradually decrease the significance of the errors (extraction of irrelevant key-phrase) as we go down the list of extracted key-phrases. Mean Average Precision does just that. Here is how Mean Average Precision is calculated:

For each document:

For each relevant key-phrase in the list of extracted key-phrases:
* Compute precision of the list till that relevant key-phrase.
Average sub-list precision scores

2. Finally take the mean of all the Average Precision scores for all the documents.

Let’s visualize this process:

PROS:

Naturally considers the ranking aspect in the evaluation
Penalizes more when irrelevant key-phrases are extracted higher up the list as compared to further down the list.

CONS:

Works really well when there are only binary ratings (relevant/irrelevant). That is when all relevant or irrelevant key-phrases are equally relevant or irrelevant respectively. If fine-grained numerical ratings are involved, that is if there is levels of relevance or irrelevance then this metric fails to evaluate on this fine-grained information.
If fine-grained numerical ratings are involved then in order to use this metric, we have make the relevancy binary based on some thresholding.

Evaluation Metric: nDCG (Discounted Cumulative Gain)

Just as MAP, nDCG also aims at valuing a relevant key-phrase higher up the predicted list. However, nDCG goes one step further and is able to use the fact that some key-phrases might be more relevant than the others. Therefore it also evaluates based on whether highly relevant key-phrases occur before the key-phrases with medium or low relevance.

First step in calculating nDCG involves calculation of Discounted Cumulative Gain(DCG). DCG aims at capturing the effectiveness of an algorithm by penalizing when highly relevant key-phrases appear lower in the result list. DCG accumulated till a particular rank k is given by:

Here rel_i is the graded relevance of key-phrase at position i. The denominator serves a logarithmic reduction factor to penalize in proportion to the position of the results.

Depending on various factors, the number of predicted key-phrases may vary for every document, which makes DCG not comparable for all the documents. nDCG provides a score which has a proper upper and lower bounds so that we can take a mean across all the result scores to report a final score. This is done using normalization by determining the ideal ranking of key-phrases to in-turn determine the Ideal DCG.

To compute nDCG, for any key-phrase prediction, we have to compute:

The DCG of the prediction.
The DCG of the gold key-phrases, which would give the maximum DCG. This would be the Ideal DCG (IDCG).

The value of nDCG would range between 0 and 1. Let’s visualize the nDCG computation in the figure below for a single document when the output of the extracted key-phrases also contain the relevancy ratings.

To compute the nDCG for multiple documents, simply take the mean of the nDCG of the individual key-phrase predictions of all the documents.

PROS:

nDCG is a good choice of evaluation metric when graded relevance values are available in your dataset. The primary reason for choosing this metric is to take the graded relevance values into account.
It is better than MAP in terms of evaluating the position of ranked key-phrases.

CONS:

It does not penalize for false positives, predictions that aren’t relevant. However this can be adjusted by using negative relevance for bad key-phrases in the results.

Evaluation in Action

In this section, we will evaluate different Statistical and Graphical models (discussed in the previous two blogs of this series). We choose Mean Average Precision as our evaluation metric because we want to evaluate considering the ranks at which relevant key-phrases are extracted and our datasets do not contain graded relevance rather binary relevance (Relevant or Irrelevant).

Let us quickly look at the two datasets that we have considered:

DUC Dataset:
1. It has a total of 307 samples.
2. For these experiments I divided it into Train and Test with with a 90% and 10% split
Marujo Dataset:
1. 450 Train Samples
2. 50 Test Samples

Before we move ahead, we first need to carefully understand the strategies for determining true positives. Let us first look at these strategies before we look different evaluation metrics.

Criteria for Matching Phrases

For all the evaluation metrics, the first step is to actually determine if a key-phrase extracted by the algorithm is indeed relevant or not. This is done by matching the extracted key-phrase with a set of annotated key-phrases. There can be different criteria for matching.

Suppose our model extracts “Keyword Extractor” as a key-phrase and the set of annotated key-phrases consists of “Keyword Extractors”. Even though one is singular while the other is plural we would still like to count this as a match. On the other hand, if our model only extracts “Extractor” as a key-phrase, we would not want to consider this as a match.

Therefore, it is best to compare these key-phrases after stemming them. Stemming is a process where we removed all the suffices from a word. Once stemmed we can then compare the sequence of these stemmed words to determine a match. This way “Keyword Extractor” matches “Keyword Extractors” while “Extractor” doesn’t. It also goes without saying that the sequence after stemming also matters.

For this analysis I do a strict matching post stemming. I use NLTK’’s Porter Stemmer to convert all the predicted and gold keywords into their base forms. I used the implementations provided by YAKE and PKE packages for the models used in this analysis. The graph below shows the performance of the algorithms discussed in this series:

Apart from TopicRank there seems to be no single model performing consistently well over both datasets. For Marujo dataset, TopicRank only slightly outperforms the more simple TFIDF and TF statistical models. On the other hand, all the graphical models perform incredibly better than all the statistical models on DUC dataset.

Keywords Extraction is a heavily subjective task and as it can be seen, there is no single method that works well for all the different datasets. For the Marujo dataset, there is only a slight difference between the graphical and statistical models, while for the DUC dataset, the graphical models clearly shine over the statistical models.

Summary

In this article, we closely looked at the different Evaluation Metrics for Keyword Extraction. We looked at how the traditional evaluation metrics such as F1, Precision and Recall fail to properly evaluate the keyword extraction task. We saw how MRR, MAP and nDCG overcome these drawbacks by considering the order in which the relevant or irrelevant key-phrases are extracted. We also understood the pros and cons of these metrics and the scenarios in which one metric would be more suited than the other. We also looked at what we should consider as a matching criteria while evaluating the keyword extraction task. We also analyzed the performance different Statistical and Graphical approaches based on Mean Average Precision to realize how subjective this task really is.

About Me: Graduated with a Masters in Computer Science from ASU. I am a NLP Scientist at GumGum. I am interested in applying Machine Learning/Deep Learning to provide some structure to the unstructured data that surrounds us.

We’re always looking for new talent! View jobs.