Over the past year, machine learning models have dramatically improved scores across many language understanding tasks in NLP. ELMo, BERT, ALICE, the model formerly known as BigBird (now MT-DNN), and OpenAI GPT have advanced a surprisingly effective recipe that combines language modeling pretraining on huge text datasets with simple multitask and transfer learning techniques that adapt the resulting model to downstream applications.
GLUE, released a year ago, is a benchmark and toolkit for evaluating recipes like these (think the Great British Baking Show meets Sesame Street). GLUE is a collection of nine (English) language understanding tasks — things like textual entailment, sentiment analysis, and grammaticality judgments — and was meant to cover a big enough swath of NLP so that the only way to do well at it was to build tools so general that they would help with most new language understanding problems that might come along.
Progress on GLUE
The best models on GLUE now come very close to our estimate for how good humans are at these tasks:
Model performance jumped sharply with the introduction of GPT and BERT, and has steadily gained on human performance as researchers continue to develop better algorithms for adapting BERT to these tasks. On three GLUE tasks (QNLI, MRPC, and QQP), the best models already outperform human baselines, though this hardly means that machines have mastered English. For example, the WNLI task involves determining if a sentence like “John couldn’t fit the trophy in the suitcase because it was too big.” implies the sentence “The trophy was too big.” is true. Humans can perfectly solve the task while machines have yet to improve over random guessing.
It’s clear there’s still progress to be made in getting machines to understand natural language, but GLUE won’t be the right benchmark for evaluating that much longer.
Like GLUE, SuperGLUE is a benchmark for evaluating general-purpose NLP models based on evaluation on a diverse set of language understanding tasks.
To discover a new set of challenging tasks, we sent out a call for task proposals to the broader NLP community, who responded with gusto, giving us a list of around 30 diverse NLP tasks. In selecting tasks for SuperGLUE, we obviously wanted tasks that involved language understanding, are not yet solvable by existing methods, but are easily solvable by people. To check for this, we ran BERT-based baselines for many candidate tasks, and collected data for human baselines. The final result is a set of seven tasks that we believe are challenging for existing models.
We’ve kept two of GLUE’s tasks that still have substantial headroom: Recognizing Textual Entailment and Winograd Schema Challenge. In addition, we add tasks that test a model’s ability to answer questions, do coreference resolution, and perform commonsense reasoning.
If you’re curious about SuperGLUE, have a look at our paper for a preview! The full benchmark, including the data, an evaluation server, and a software toolkit including our baselines, should be available around the beginning of May. If you want to stay in the loop about SuperGLUE, join our discussion group.
The SuperGLUE team