Two minutes NLP — Popular NLP Benchmarks [Part 1]

Billion Word Benchmark, SQuAD, WMT, DBpedia, SST, etc

Fabio Chiusano
NLPlanet
6 min readApr 1, 2022

--

Hello fellow NLP enthusiasts! New NLP models are always evaluated against benchmarks. Benchmarks are necessary for research, without them we wouldn’t be able to efficiently compare several models. Enjoy! 😄

As there are several benchmarks for each NLP task, it’s difficult to have all of them in a single article. For this reason, this article focuses only on the following NLP tasks and benchmarks:

  • Language Modeling: WikiText-103, Billion Word Benchmark, and LAMBADA.
  • Question Answering: SQuAD (Stanford Question Answering Dataset), HotpotQA, and TriviaQA.
  • Machine Translation: WMT 2014.
  • Text Classification: AG News (AG’s News Corpus) and DBpedia.
  • Sentiment Analysis: SST (Stanford Sentiment Treebank).

Other tasks like Text Summarization, Natural Language Inference, Named Entity Recognition, Relation Extraction, and Reading Comprehension will be covered in a later article. Explore the paperswithcode NLP area to find out about other benchmarks.

Language Modelling

Language modeling is the task of predicting the next word or character in a document. It can be used to train language models that can be applied to a variety of natural language tasks including text generation, text classification, question answering, and many others.

The metrics commonly used in language modeling are cross-entropy and perplexity.

WikiText-103

The WikiText dataset is made of over 100 million tokens extracted from high-quality Wikipedia articles.

Language Modelling test perplexity on WikiText-103. Image from https://paperswithcode.com/sota/language-modelling-on-wikitext-103.

Some of the best models are Megatron-LM, GLM-XXLarge, and kNN-LM.

One Billion Word Benchmark

The One Billion Word dataset is a dataset for language modeling, produced from the WMT 2011 News Crawl data with some cleaning postprocessing.

Language Modelling test perplexity on Billion Word Benchmark. Image from https://paperswithcode.com/sota/language-modelling-on-one-billion-word.

Some of the best models are OmniNet and Transformer-XL Large.

LAMBADA

LAMBADA (LAnguage Modeling Broadened to Account for Discourse Aspects) is a benchmark whose task is very similar to language modeling. The assignment is to recover a missing word from a portion of text, where the missing word is always the last word of its sentence.

In LAMBADA, text samples share the characteristic that human subjects are able to guess their last word if they are exposed to the whole passage, but not if they only see the last sentence preceding the target word.

Since the goal is to recover missing words, the performance of a model is measured with accuracy.

Language Modelling test accuracy on LAMBADA. Image from https://paperswithcode.com/sota/language-modelling-on-lambada.

Some of the best models are GPT-3 and GLM-XXLarge.

Question Answering

Question Answering models are able to retrieve the answer to a question from a given text. This is useful for searching for an answer in a document. Depending on the model used, the answer can be directly extracted from text or generated from scratch.

The performance of a question answering model is measured with Exact Match (EM) or accuracy with respect to the ground-truth answers.

SQuAD (Stanford Question Answering Dataset)

The dataset that is used the most as an academic benchmark for extractive question answering is SQuAD (The Stanford Question Answering Dataset). SQuAD is a reading comprehension dataset, consisting of questions posed by crowd-workers on a set of Wikipedia articles, where the answer to every question is a segment of text from the corresponding reading passage. It contains 100,000+ question-answer pairs on 500+ articles.

There is also a harder SQuAD v2 benchmark, which includes questions that don’t have an answer. It combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowd-workers to look similar to answerable ones.

Question Answering test Exact Match on SQuAD2.0. Image from https://paperswithcode.com/sota/question-answering-on-squad20.

Some of the best models are Retro-Reader and ALBERT.

HotpotQA

HotpotQA is a question answering dataset collected on the English Wikipedia, containing about 113K crowd-sourced questions that are constructed to require the introduction paragraphs of two Wikipedia articles to answer.

Models are evaluated on their answer accuracy and explainability, where the former is measured as the overlap between the predicted and gold answers with Exact Match (EM) and unigram F1, and the latter concerns how well the predicted supporting fact sentences match human annotation (Supporting Fact EM/F1). A joint metric is also reported on this dataset, which encourages systems to perform well on both tasks simultaneously.

Question Answering test joint-F1 on HotpotQA. Image from https://paperswithcode.com/sota/question-answering-on-hotpotqa.

Some of the best models are BigBird, AISO, and HopRetriever.

TriviaQA

TriviaQA is a realistic text-based question answering dataset which includes 950K question-answer pairs from 662K documents collected from Wikipedia and the web. The difficulty in TriviaQA is that the answer to a question may not be directly obtained by span prediction and the context is very long.

Question Answering test F1 on TriviaQA. Image from https://paperswithcode.com/sota/question-answering-on-triviaqa.

Some of the best models are SpanBERT, BigBird, and LinkBERT.

Machine Translation

Machine translation is the task of translating a sentence in a source language to a different target language. Recently, encoder-decoder attention-based architectures like BERT have attained major improvements in machine translation.

Some of the most commonly used evaluation metrics for machine translation systems include BLEU, METEOR, NIST, and others.

WMT 2014

WMT 2014 is a collection of datasets from the Ninth Workshop on Statistical Machine Translation, featuring translation tasks.

Machine Translation test BLEU score on WMT2014 English-German. Image from https://paperswithcode.com/sota/machine-translation-on-wmt2014-english-german.

Some of the best models are Transformer Cycle, Noisy Back-translation, and Transformer+Rep.

Text Classification

Text classification is the task of assigning a sentence or document an appropriate category. Text classification problems include emotion classification, news classification, and citation intent classification, among others.

In recent years, deep learning techniques like XLNet and RoBERTa have attained some of the biggest performance jumps for text classification problems.

AG News (AG’s News Corpus)

AG News (AG’s News Corpus) is made of titles and description fields of articles from the classes “World”, “Sports”, “Business”, and “Sci/Tech”. The AG News contains 30,000 training and 1,900 test samples per class.

Text Classification test errors on AG News. Image from https://paperswithcode.com/sota/text-classification-on-ag-news.

Some of the best models are XLNet, BERT-ITPT-FiT, L MIXED, and ULMFiT.

DBpedia

DBpedia is a project aiming to extract structured content from the information created in the Wikipedia project. DBpedia allows users to semantically query relationships and properties of Wikipedia resources, including links to other related datasets.

Text Classification test errors on DBPedia. Image from https://paperswithcode.com/sota/text-classification-on-dbpedia.

Some of the best models are XLNet, Bidirectional Encoder Representations from Transformers, and BERT large.

Sentiment Analysis

Sentiment analysis is the task of classifying the polarity of a given text. For instance, a text-based tweet can be categorized as either “positive”, “negative”, or “neutral”.

Recently, deep learning techniques, such as RoBERTa and T5, are used to train high-performing sentiment classifiers that are evaluated using metrics like F1, recall, and precision.

SST (Stanford Sentiment Treebank)

The Stanford Sentiment Treebank is a corpus with fully labeled parse trees that allows for a complete analysis of the compositional effects of sentiment in language. The corpus consists of 11,855 single sentences extracted from movie reviews.

Each phrase is labeled as either negative, somewhat negative, neutral, somewhat positive, or positive. The corpus with all 5 labels is referred to as SST-5 or SST fine-grained. Binary classification experiments on full sentences refer to the dataset as SST-2 or SST binary.

Sentiment Analysis test accuracy on SST-2 Binary classification. Image from https://paperswithcode.com/sota/sentiment-analysis-on-sst-2-binary.

Some of the best models are SMART-RoBERTa Large, T5–3B, and MUPPET Roberta Large.

Conclusions and next steps

In this article, we saw some of the commonly used benchmarks for Language Modeling, Question Answering, Machine Translation, Text Classification, and Sentiment Analysis.

Possible next steps are:

Thank you for reading! If you are interested in learning more about NLP, remember to follow NLPlanet on Medium, LinkedIn, and Twitter!

--

--

Fabio Chiusano
NLPlanet

Freelance data scientist — Top Medium writer in Artificial Intelligence