Two minutes NLP — GLUE Tasks and 2022 Leaderboard

Single-sentence tasks, similarity and paraphrase tasks, and inference tasks

Fabio Chiusano
NLPlanet
5 min readFeb 28, 2022

--

Hello fellow NLP enthusiasts! Today I want to delve into one of the most popular NLP benchmarks used nowadays, that is GLUE. Having a good understanding of GLUE helps in understanding the strenghts and weaknesses of popular NLP models, as many of them are evaluated with it. Enjoy! 😄

GLUE, the General Language Understanding Evaluation benchmark, is a collection of resources for training, evaluating, and analyzing natural language understanding systems.

GLUE is centered on nine English sentence understanding tasks, which cover a broad range of domains, data quantities, and difficulties. As the goal of GLUE is to spur the development of generalizable NLU systems, the benchmark is designed such that good performance should require a model to share substantial knowledge across all tasks, while still maintaining some task-specific components.

The final GLUE score is obtained by averaging the scores across all nine tasks.

Let’s see what these tasks are about.

Single-Sentence Tasks

1. CoLA (Corpus of Linguistic Acceptability)

  • Goal: determine if a sentence is grammatically correct or not.
  • Dataset: it consists of English acceptability judgments drawn from books and journal articles. Each example is a sequence of words annotated with whether it is a correct grammatical English sentence or not.

2. SST-2 (Stanford Sentiment Treebank)

  • Goal: determine if the sentence has a positive or negative sentiment.
  • Dataset: it consists of sentences from movie reviews and binary human annotations of their sentiment.

Similarity and Paraphrase Tasks

3. MRPC (Microsoft Research Paraphrase Corpus)

  • Goal: determine if two sentences are paraphrases from one another.
  • Dataset: it’s a corpus of sentence pairs automatically extracted from online news sources, with human annotations indicating whether the sentences in the pair are semantically equivalent (i.e. paraphrases).

4. QQP (Quora Question Pairs)

  • Goal: determine if two questions are semantically equivalent or not.
  • Dataset: it’s a collection of question pairs from the community question-answering website Quora, with human annotations indicating whether the questions in the pair are actually the same question.

5. STS-B (Semantic Textual Similarity Benchmark)

  • Goal: determine the similarity of two sentences with a score from one to five.
  • Dataset: it’s a collection of sentence pairs drawn from news headlines, video and image captions, and natural language inference data. Each pair is annotated by humans with a similarity score from one to five.

Inference Tasks

6. MNLI (Multi-Genre Natural Language Inference)

  • Goal: determine if a sentence entails, contradicts, or is unrelated to another sentence.
  • Dataset: it’s a crowdsourced collection of sentence pairs with textual entailment annotations. The premise sentences are gathered from ten different sources, including transcribed speech, fiction, and government reports. The dataset has two test sets: a matched (in-domain) and mismatched (cross-domain) test set. The scores on the matched and mismatched test sets are then averaged together to give the final score on the MNLI task.

7. QNLI (Question-answering Natural Language Inference)

  • Goal: determine if the answer to a question is contained in a second sentence or not.
  • Dataset: it’s a question-answering dataset consisting of question-paragraph pairs, where one of the sentences in the paragraph (drawn from Wikipedia) contains the answer to the corresponding question (written by an annotator).

8. RTE (Recognizing Textual Entailment)

  • Goal: determine if a sentence entails a given hypothesis or not.
  • Dataset: it is a combination of data from annual textual entailment challenges (i.e. from RTE1, RTE2, RTE3, and RTE5). Examples are constructed based on news and Wikipedia text.

9. WNLI (Winograd Natural Language Inference)

  • Goal: determine if a sentence with an anonymous pronoun and a sentence with this pronoun replaced are entailed or not.
  • Dataset: this dataset is built from the Winograd Schema Challenge dataset, where it’s a reading comprehension task in which a system must read a sentence with a pronoun and select the referent of that pronoun from a list of choices. To convert the problem into sentence pair classification, the authors of the benchmark construct sentence pairs by replacing the ambiguous pronoun with each possible referent. The examples are manually constructed to foil simple statistical methods: each one is contingent on contextual information provided by a single word or phrase in the sentence.

Tasks recap

The following image recaps the train and test set sizes across the GLUE tasks, as well as the metrics used and the domains involved.

Recap of the train and test set sizes across the GLUE tasks, as well as the metrics used and the domains involved. Image from https://arxiv.org/pdf/1804.07461.pdf.

GLUE Leaderboard

You can find the best scores on the GLUE leaderboard, which I reported in the following images. The AX column contains the score on the diagnostic dataset (which we talk about in the next section) and it’s not used in the final GLUE score.

Leaderboard on the GLUE benchmark, showing the scores of the top 11 submissions. Image from https://gluebenchmark.com/leaderboard.
Leaderboard on the GLUE benchmark, showing the models of the top 11 submissions. Image from https://gluebenchmark.com/leaderboard.

The Diagnostic Dataset

GLUE contains also a dataset called Diagnostic Dataset, not intended as a benchmark, but as an analysis tool for error analysis, qualitative model comparison, and development of adversarial examples. It mainly deals with textual entailment.

Samples from the diagnostic dataset. Image from https://arxiv.org/pdf/1804.07461.pdf.

Conclusions and next steps

In this article, we saw what is the GLUE benchmark and what are its nine tasks. Then, we peeked into the current GLUE leaderboard and saw some samples from its diagnostic dataset.

Possible next steps are:

  • Training a model on a GLUE task and comparing its performance against the GLUE leaderboard.
  • Learning about SuperGLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public leaderboard.
  • Learning about the top-performing models in the GLUE tasks.

--

--

Fabio Chiusano
NLPlanet

Freelance data scientist — Top Medium writer in Artificial Intelligence