Two minutes NLP — Popular NLP Benchmarks [Part 1]
Billion Word Benchmark, SQuAD, WMT, DBpedia, SST, etc
Hello fellow NLP enthusiasts! New NLP models are always evaluated against benchmarks. Benchmarks are necessary for research, without them we wouldn’t be able to efficiently compare several models. Enjoy! 😄
As there are several benchmarks for each NLP task, it’s difficult to have all of them in a single article. For this reason, this article focuses only on the following NLP tasks and benchmarks:
- Language Modeling: WikiText-103, Billion Word Benchmark, and LAMBADA.
- Question Answering: SQuAD (Stanford Question Answering Dataset), HotpotQA, and TriviaQA.
- Machine Translation: WMT 2014.
- Text Classification: AG News (AG’s News Corpus) and DBpedia.
- Sentiment Analysis: SST (Stanford Sentiment Treebank).
Other tasks like Text Summarization, Natural Language Inference, Named Entity Recognition, Relation Extraction, and Reading Comprehension will be covered in a later article. Explore the paperswithcode NLP area to find out about other benchmarks.
Language Modelling
Language modeling is the task of predicting the next word or character in a document. It can be used to train language models that can be applied to a variety of natural language tasks including text generation, text classification, question answering, and many others.
The metrics commonly used in language modeling are cross-entropy and perplexity.
WikiText-103
The WikiText dataset is made of over 100 million tokens extracted from high-quality Wikipedia articles.
Some of the best models are Megatron-LM, GLM-XXLarge, and kNN-LM.
One Billion Word Benchmark
The One Billion Word dataset is a dataset for language modeling, produced from the WMT 2011 News Crawl data with some cleaning postprocessing.
Some of the best models are OmniNet and Transformer-XL Large.
LAMBADA
LAMBADA (LAnguage Modeling Broadened to Account for Discourse Aspects) is a benchmark whose task is very similar to language modeling. The assignment is to recover a missing word from a portion of text, where the missing word is always the last word of its sentence.
In LAMBADA, text samples share the characteristic that human subjects are able to guess their last word if they are exposed to the whole passage, but not if they only see the last sentence preceding the target word.
Since the goal is to recover missing words, the performance of a model is measured with accuracy.
Some of the best models are GPT-3 and GLM-XXLarge.
Question Answering
Question Answering models are able to retrieve the answer to a question from a given text. This is useful for searching for an answer in a document. Depending on the model used, the answer can be directly extracted from text or generated from scratch.
The performance of a question answering model is measured with Exact Match (EM) or accuracy with respect to the ground-truth answers.
SQuAD (Stanford Question Answering Dataset)
The dataset that is used the most as an academic benchmark for extractive question answering is SQuAD (The Stanford Question Answering Dataset). SQuAD is a reading comprehension dataset, consisting of questions posed by crowd-workers on a set of Wikipedia articles, where the answer to every question is a segment of text from the corresponding reading passage. It contains 100,000+ question-answer pairs on 500+ articles.
There is also a harder SQuAD v2 benchmark, which includes questions that don’t have an answer. It combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowd-workers to look similar to answerable ones.
Some of the best models are Retro-Reader and ALBERT.
HotpotQA
HotpotQA is a question answering dataset collected on the English Wikipedia, containing about 113K crowd-sourced questions that are constructed to require the introduction paragraphs of two Wikipedia articles to answer.
Models are evaluated on their answer accuracy and explainability, where the former is measured as the overlap between the predicted and gold answers with Exact Match (EM) and unigram F1, and the latter concerns how well the predicted supporting fact sentences match human annotation (Supporting Fact EM/F1). A joint metric is also reported on this dataset, which encourages systems to perform well on both tasks simultaneously.
Some of the best models are BigBird, AISO, and HopRetriever.
TriviaQA
TriviaQA is a realistic text-based question answering dataset which includes 950K question-answer pairs from 662K documents collected from Wikipedia and the web. The difficulty in TriviaQA is that the answer to a question may not be directly obtained by span prediction and the context is very long.
Some of the best models are SpanBERT, BigBird, and LinkBERT.
Machine Translation
Machine translation is the task of translating a sentence in a source language to a different target language. Recently, encoder-decoder attention-based architectures like BERT have attained major improvements in machine translation.
Some of the most commonly used evaluation metrics for machine translation systems include BLEU, METEOR, NIST, and others.
WMT 2014
WMT 2014 is a collection of datasets from the Ninth Workshop on Statistical Machine Translation, featuring translation tasks.
Some of the best models are Transformer Cycle, Noisy Back-translation, and Transformer+Rep.
Text Classification
Text classification is the task of assigning a sentence or document an appropriate category. Text classification problems include emotion classification, news classification, and citation intent classification, among others.
In recent years, deep learning techniques like XLNet and RoBERTa have attained some of the biggest performance jumps for text classification problems.
AG News (AG’s News Corpus)
AG News (AG’s News Corpus) is made of titles and description fields of articles from the classes “World”, “Sports”, “Business”, and “Sci/Tech”. The AG News contains 30,000 training and 1,900 test samples per class.
Some of the best models are XLNet, BERT-ITPT-FiT, L MIXED, and ULMFiT.
DBpedia
DBpedia is a project aiming to extract structured content from the information created in the Wikipedia project. DBpedia allows users to semantically query relationships and properties of Wikipedia resources, including links to other related datasets.
Some of the best models are XLNet, Bidirectional Encoder Representations from Transformers, and BERT large.
Sentiment Analysis
Sentiment analysis is the task of classifying the polarity of a given text. For instance, a text-based tweet can be categorized as either “positive”, “negative”, or “neutral”.
Recently, deep learning techniques, such as RoBERTa and T5, are used to train high-performing sentiment classifiers that are evaluated using metrics like F1, recall, and precision.
SST (Stanford Sentiment Treebank)
The Stanford Sentiment Treebank is a corpus with fully labeled parse trees that allows for a complete analysis of the compositional effects of sentiment in language. The corpus consists of 11,855 single sentences extracted from movie reviews.
Each phrase is labeled as either negative, somewhat negative, neutral, somewhat positive, or positive. The corpus with all 5 labels is referred to as SST-5 or SST fine-grained. Binary classification experiments on full sentences refer to the dataset as SST-2 or SST binary.
Some of the best models are SMART-RoBERTa Large, T5–3B, and MUPPET Roberta Large.
Conclusions and next steps
In this article, we saw some of the commonly used benchmarks for Language Modeling, Question Answering, Machine Translation, Text Classification, and Sentiment Analysis.
Possible next steps are:
- Explore the paperswithcode NLP area to find out about other benchmarks.
- Learn about the GLUE set of benchmarks.
- Learn about the SuperGLUE set of benchmarks.