Sitemap
TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Evaluation of an NLP model — latest benchmarks

And why it matters

9 min readApr 13, 2020

--

Press enter or click to view image in full size
Photo by Carlos Muza on Unsplash

Need for an evaluation metric

Loss calculation in other areas

In most of deep learning tasks — classification, regression, or image generation — it is pretty easy and straight forward to evaluate the performance of a model because of the solution space is finite/not very large.

In most classification problems, the number of labels is fixed, which makes it possible to calculate the score for each class and hence makes it possible to calculate the loss/offset from the ground truth.

In case of image generation, the resolution of the output and ground truth is fixed. Therefore, we can calculate the loss/offset from ground truth at a pixel. level.

In case of regression, the number of values in the output is fixed and hence loss can be calculated for each value, even though the possibilities for each value is infinite.

Note: I use problem and task interchangeably in this post, as each problem sort of defined around a task that a model needs to perform, so please don’t get confused by that.

Issues in loss calculation in NLP

In case of NLP, even if the output format is predetermined, the dimensions cannot be fixed. If we want our model to output a single sentence, it will be counter intuitive to restrict the number of words in that sentence, because we know that there are multiple ways to express the same information.

An Example

Press enter or click to view image in full size
Photo by Olav Tvedt on Unsplash

If I ask you to choose the incorrect caption for the image above from the following options:

  • A car in front of a house.
  • A car on the road in front of a house.
  • A gray car on the road in front of a house.
  • A gray sports car in front of a house.

You can choose any of them and still be correct, none of them is incorrect, only the details vary. If the ground truth for this image in our dataset is “A gray car on the road”. How will you teach a model using the ground truth label that all 4 outputs are correct? Not so easy, is it?

Why it matters to have good evaluation metrics?

Before we dive into the details and nuances of various metrics out there, I want to talk here not only about why it matters to have a good metric, but also what a good metric is — in my opinion — just in case you don’t make it till the end of the article.

The main purpose of developing these AI solutions is to apply them to real-world problems and make our lives easier and better. But our real world is not a simple one. So how do we decide which model to use for a particular problem? That is when these metrics come in handy.

If we are trying to solve two problems with one model, we would want to know the model’s performance on both of these tasks, to make an informed decision, to be aware of the trade-offs we are making. This is also where the “goodness” of a metric comes in. The real world is full of biases and we don’t want our solutions to be biased as it can have inconceivable consequences.

Quick example, if we are translating a text from language X to English. For a particular sentence, if we are talking about Group A, it is translated to “They did a good job.” in contrast for Group B it is translated to “They did a great job.”, that is a crystal clear sign that our model is biased towards Group B. Such biases should be known before it is deployed in the real world and metrics should help us in surfacing these.

Even though learning biases has more to do with training data and less to do with model architecture, I feel, having a metric for capturing biases or a standard for biases would be a good practice to adapt.

BLEU Score — BiLingual Evaluation Understudy

As the name suggests, it was originally used to evaluate translations from one language to another.

How to calculate BLEU score?

Calculating unigram precision:

Step 1: Look at each word in the output sentence and assign it a score of 1 if it shows up in any of the reference sentences and 0 if it doesn’t.
Step 2: Normalize that count, so that it’s always between 0 and 1, by dividing the number of words that showed up in one of the reference translations by the total number of words in the output sentence.

Continuing with the example above:
Ground Truth: “A gray car on the road
S1: “A car on the road” will get a score of 5/6,
S2: “A gray car in front of a house on the road” will get a score of 6/6,
S3: “on car gray the car road” will also get a score of 6/6!
In fact S4: “car car car car car car” will also get a score of 6/6!!

This does not seem to be correct; we should not be scoring S3 and S4 so highly. To penalize the last two scenarios, we use a combination of unigram, bigram, trigram, and n-gram by multiplying them. Using n-grams helps us in capturing the ordering of a sentence to some extent — S3 scenario. We also cap the number of times to count each word based on the highest number of times it appears in any reference sentence, which helps us avoid unnecessary repetition of words — S4 scenario.

Lastly, to avoid missing out on the details in our captions by encouraging them to be too short, we use the brevity penalty. We do this by comparing it to the length of the reference sentence that is the closest in length.

if (length(output) > length(shortest_References))
brevity_Penalty = 1;
else { // output caption is shorter.
brevity_Penalty = exponent(length(output)/length(shortest_Reference)-1);
}
BLEU Score = unigram_precision * bigram_precision *...* ngram_precision * brevity_Penalty;

The problem(s) with BLEU:

  • It doesn’t consider meaning: In reality, the words in a sentence contribute unequally to the meaning of a sentence. But the BLEU method evaluates all words as equal.
    Ground truth: I have a maroon car.
    S1: I have a blue car.
    S2: I have a red car.
    S3: I have a maroon boat.
    => All will get the same score, even though S1 and S3 are conveying the wrong information here.
  • It doesn’t directly consider sentence structure: It can’t capture syntax. The order of words doesn’t contribute a lot to the score of a sentence. Or in case of translations, the order of sentences.
    For example, if I’m trying to evaluate the translations of a chapter in a novel, and I swap the first and second halves of the translation, it will only affect the BLEU score a tad bit. Meanwhile, the translation’s storyline will be completely distorted, which is only acceptable in Nolan movies.
  • It doesn’t handle morphologically rich languages. If you want to say “I am tall” in French, “Je suis grand” and “Je suis grande” are both correct translations. The only difference is the gender of the word “grand(e)”. But for a BLEU, these words might be as different as night and day. If a model predicts one, while expecting the other, BLEU will just consider it as a mismatch.

SQuAD — Stanford Question Answering Dataset

In this type of metric test, a question and a text are given, and the model needs to predict the answer from the given text.

[In case you aren’t familiar with the fantasy reference, here is all you need to know to understand my analogy:
Harry Potter is a book series by JK Rowling;
Percy Jackson & the Olympians is a entirely different and unrelated series by Rick Riordan;
In the Harry Potter series, the protagonist — Harry Potter, is either in trouble or is on the way of getting into one, for like 90% of the time;
Percy Jackson is the protagonist of the other book series I mentioned, and also shares the tendency of getting in trouble with Harry;
There is a character named Percy, albeit not a very important one, in the Harry Potter series.
Yep, that is all you need to know for this. :D]

If I give you a text from the Harry Potter series and ask you “Why were Percy Jackson and his friends in trouble”? A human would be able to tell that the question and text aren’t contextually related and hence the question is not answerable; a Potter-head would just disown me. But when asked to an NLP model, it might just try to predict the most probable answer, which could be about some other character getting in trouble — hint Harry Potter or something a character named Percy or Jackson did.

To address these weaknesses, SQuAD 2.0 combined existing SQuAD data with over 50,000 unanswerable questions written adversarial-ly by crowd workers to look similar to answerable ones. To do well on SQuAD 2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.

You can check out the SQuAD Leader board here.

MS MACRO — MAchine Reading COmprehension Dataset

It is a large scale dataset focused on machine reading comprehension. It consists of the following tasks:

  • Question Answering — generate a well-formed answer (if possible) based on the context passages that can be understood with the question and passage context.
  • Passage Ranking — rank a set of retrieved passages given a question.
  • Key Phrase Extraction — predict if a question is answerable given a set of context passages, and extract and synthesize the answer as a human would.

The dataset started off focusing on QnA but has since evolved to focus on any problem related to search.

The dataset comprises of 1,010,916 anonymized questions — sampled from Bing’s search query logs — each with a human-generated answer and 182,669 completely human rewritten generated answers. In addition, the dataset contains 8,841,823 passages — extracted from 3,563,535 web documents retrieved by Bing — that provide the information necessary for curating the natural language answers.

GLUE and SuperGLUE — General Language Understanding evaluation

GLUE and SuperGLUE evaluate the performance of a model on a collection of tasks, rather than a single one to get a holistic view on the performance. They consist of single-sentence tasks, similarity and paraphrase tasks, and inference tasks.

I went through the paper for both of them and collected a brief overview of the tasks in each. I organized them into these tables for quick and better understanding. If you want to understand each task in detail, please go through the papers, linked in the reference section, at the end of the post.

Press enter or click to view image in full size
Tasks in GLUE.
Press enter or click to view image in full size
Tasks in Super GLUE.
Press enter or click to view image in full size
Examples for each tasks given in Super GLUE paper. SOURCE.

Do check out the respective leader boards for GLUE and Super GLUE.

Knowing about these metrics would not help you in improving the performance of your model by any means. However, it will certainly help you in understanding the bigger picture, the problems we are currently trying to solve in this field! I hope you found it useful.

Reference and other good reads:

I‘m glad you made it till the end of this article. 🎉
I hope your reading experience was as enriching as the one I had writing this.💖

Do check out my other articles here.

If you want to reach out to me, my medium of choice would be Twitter.

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Ria Kulshrestha
Ria Kulshrestha

Written by Ria Kulshrestha

AI enthusiast currently exploring SE @Google. Claps/Shares/Comments are appreciated💖https://twitter.com/Ree_____Ree

No responses yet