ACL 2020: Overview and New Benchmarks

Maria Khvalchik
Semantic Tech Hotspot
10 min readJul 30, 2020

ACL 2020 went fully virtual and I went there with them 🚀 I would like to start with a quote by Mihail Eric from my ACL 2019 post:

Going Beyond the Pretrain-Finetune Paradigm

In general my feeling is that the bulk of models today are still solving datasets rather than the tasks. We are building models that have become surprisingly effective at picking up and exploiting dataset-specific biases. In the process, our evaluation metrics paint fairly misleading pictures. This reminds me of Goodhart’s law: When a measure becomes a target, it ceases to be a good measure. So how do we move forward?

This gave a direction to changing the existing benchmarks and creating new ones. Looks like researchers took it seriously and this year ACL was flourishing with such papers 👏

We will jump into 6 papers targeting critical issues in NLP. Here is an outline:

  1. Brief overview and statistics on ACL 2020
  2. How to test NLP models? Best paper “Beyond Accuracy: Behavioral Testing of NLP Models with CheckList” by Microsoft Research
  3. 🔎 3️⃣ 🔍 summaries on new QA benchmarks: TYDI QA by Google Research, MLQA by Facebook AI Research, and QDMR by AI2
  4. How to address a specific domain QA if you don’t have a lot of labeled data? by AWS AI Labs
  5. How to make my QA system run faster? by Stony Brook University

Overview

The organizers did a really good job: coverage of different time zones, easy website navigation, a sophisticated search with similar papers suggestion.

All authors pre-recorded their videos, so after watching those one could jump into one of the Q&A zoom sessions. I must say that these pre-recorded videos thingy is extremely convenient and finally brought the so much needed flexibility to the conference attendees 🛋️

Let me know if this screenshot violates any rules. If I delete it, this fantastic bow tie will lose its magic!

The downside, no doubt, was the lack of networking. Though there were chats and video conferencing, but hey? Where is my old good trick with spilling the coffee over a person to finally start my small talk 🤷‍♀️ Well, you got it, networking in person is still an essential thing for Homo sapiens.

The tracks with the highest number of submissions were Machine Learning for NLP, Dialogue and Interactive Systems, Machine Translation, Information Extraction, NLP applications. As the song says, these are a few of my favorite things.

There were 3,429 submissions, 779 accepted papers (571 long papers and 208 short papers). The acceptance rate was 22.7% which actually dropped down compared to the previous three years.

Best Paper 🙌

1. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList

by Marco Tulio Ribeiro et al. was inspired by principles of behavioral testing in software engineering. They introduced the CheckList, a task agnostic methodology for testing NLP models.

The essential question here is What to test? 🤔

CheckListing a commercial sentiment analysis model

Following Linguistic capabilities should be tested:

  • Minimum Functionality Test (MFT) alike unit tests;
  • Perturbation test: Invariance Test (INV) which expects the model predictions not to change if, for instance, we change a location in the example from Chicago to Dallas;
  • Another perturbation test: Directional Expectations, for example adding more information to the example.

In the example above, the tests are structured as a conceptual matrix with capabilities as rows and test types as columns.

What tools do they have?

  1. Templating + Roberta. Templating with masked language models on the left: “I really {mask} the flight.” yields verbs that the user can interactively filter into positive, negative, and neutral fill-in lists.
  2. Lexicons
  3. Perturbation libraries
  4. Visualizations

In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it. Check it out 👇

Here We Roll To The New Benchmarks 🎢

2. TYDI QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages

by Jonathan Clark et al. from Google Research created a QA dataset covering 11 typologically diverse (the set of linguistic features each language expresses) languages with 204K question-answer pairs. This addresses Multilingual Modeling, Transfer Learning, Question Answering.

Basically, the authors want to answer realistic questions. Traditionally SQuAD-like dataset creation looks as follows: begin with the passage and ask a question that can be answered hence we have a lot of lexical overlap and not natural questions.

Instead, we can ask a question first, then find an article that could be a good match, and ask a reviewer to read the article and answer the question. In the example, the reviewer would not be puzzled that “pilot” and “commanded” are the same things 🧩

Furthermore, they highlight the issues of human translation. Translationese issue means that the translated text won’t be as natural as if it would have been written in the target language right from scratch. Lexical overlap means that if similar terms were used in the source language then in the target text the vocabulary won’t be rich enough as it is narrowed by the translator.

Above we can see the quality on the TYDI QA primary tasks (passage answer and minimal answer) using a naïve first-passage baseline, mBERT, and a human predictor.

3. MLQA: Evaluating Cross-lingual Extractive Question Answering

Patrick Lewis et al. from Facebook AI Research and University College London created an extractive QA dataset. It is purely an evaluation dataset, meaning there is no training data but only dev and test data in order to encourage researchers on building zero-shot 👌 lingual models. That is to say, one can train a model on English SQuAD and test it on other languages using MLQA.

MLQA is built by a novel technique of mining parallel texts from Wikipedia articles. The dataset is in SQuAD format in seven languages: English, Arabic, German, Spanish, Hindi, Vietnamese and Simplified Chinese. It is highly parallel: there are over 12K instances in English and 5K in each other language, with each instance parallel between 4 languages on average.

(a) MLQA example parallel for En-De-Ar-Vi and (b) MLQA example parallel for En-Es-Zh-Hi

First, N-way parallel sentences are identified in Wikipedia articles on the same topic and the paragraphs that contain them are extracted. Second, workers formulate questions for which answer is a span within the paragraph. Third, English questions are then translated by professional translators into all languages and the answer is annotated in the target language context.

MLQA annotation pipeline

In the example above only one target language is shown for clarity but instead of a single target language, many languages can be used at once.

N-way parallel instance approach was compromised to be a 4-way parallel: each instance is parallel between English and a combination of 3 of the other 6 languages.

Above the results of the evaluation are shown on MLQA. In all cases, transfer results are significantly behind the training-language performance.

4. BREAK It Down: A Question Understanding Benchmark

by Tomer Wolfson et al. from Allen Institute for AI introduce a Question Decomposition Meaning Representation (QDMR) for questions. QDMR constitutes the ordered list of steps necessary for answering a question 🧾

The motivation for this paper is a higher demand for reasoning over multiple pieces of evidence including such modalities as text, images, relational databases. Questions usually have a common structure regardless of previously mentioned modalities type. However, question understanding as of now is usually learned independently from the task.

Their solution is to have a Question Understanding as a standalone task which is to answer the questions as humans do — to break it down into simpler sub-questions. QDMR expresses a question through atomic questions and is a formalism over KBs, images, and text. In the example, natural language questions from multiple sources (top) are annotated with the QDMR formalism (middle) and deterministically mapped into a pseudo-formal language (bottom).

Authors used QDMR to construct the BREAK dataset, containing over 83K pairs of questions and their QDMRs.

Why QDMR is useful for open-domain QA 💡

  • it can be used to improve open-domain QA on the HotpotQA dataset (multi-hop QA over Wikipedia using black-box IR). Substituting questions with QDMR boosts IR performance by 46% to 59% (43 to 52 F1).
  • it can be deterministically converted to a pseudo-SQL formal language, which can alleviate annotation in semantic parsing applications.
  • BREAK can be used to train a sequence-to-sequence model with copying that parses questions into QDMR structures, in the paper it is shown that it substantially outperforms several natural baselines.

5. Template-Based Question Generation from Retrieved Sentences for Improved Unsupervised Question Answering

by Alexander Fabrri et al. from Yale University and AWS AI Labs suggested a Template-Based Question Generation from Retrieved Sentences. A common approach to QA has been to fine-tune a pre-trained language model on a task-specific labeled dataset, however:

  • it is difficult to find labeled data for new domains
  • manually annotating new data is slow and costly

One approach is to use domain adaptation and transfer learning methods, however, the authors present an approach to train a QA system on new domain data with unsupervised learning. Their focus is factoid QA with answers which are named entities. The main challenge is to create a question from a <context, answer> pair, whereas unsupervised QA task becomes Question Generation task.

The basic idea is to retrieve a sentence from a corpus (an Elasticsearch index of customer data) similar to the context. The reason for not using the context itself is to avoid the model to do exact matching and therefore avoid creating bias.

In the paper, you can find more details on how to select the right Elasticsearch top hits and which template components worked best. Here is a glimpse of how it looks like:

Basically, the authors have shown that generating questions for QA training by applying a simple template on a related, retrieved sentence rather than the original context sentence improves downstream QA performance: F1 score on the SQuAD dataset by about 14% 📈

6. DeFormer: Decomposing Pre-trained Transformers for Faster Question Answering

by Qingqing Cao et al. from Stony Brook University introduced DeFormer, a decomposed transformer 🔪 which substitutes the full self-attention with question-wide and passage-wide self-attentions in the lower layers.

Decomposing Transformers through the layers enables encoding each segment independently. Auxiliary supervision of the upper layer information from the original model further helps the decomposed model to compensate for information loss in the lower layers.

KD: Knowledge Distillation loss, LRS: Layerwise Representation Similarity loss.

This allows for question-independent processing of the input text representations, which in turn enables pre-computing passage representations reducing runtime computation drastically.

As DeFormer is largely similar to the original model, DeFormer can be initialized with the pre-training weights of a standard transformer, and directly fine-tune on the target QA dataset. DeFormer versions of BERT and XLNet can be used to speed up QA by over 4.3x and with simple distillation-based losses they incur only a 1% drop in accuracy 💪

To summarize, in this post we took a look at the papers which try to answer some number-one pressing questions in the NLP community:

1️⃣ we saw How to test the Linguistic Models from a software engineering perspective, 2️⃣ we can now use new benchmarks in different languages to at long last look at our models at a different angle, 3️⃣ we can try the approach by AI2 when we need to address a specific domain without having a lot of labeled data, and finally 4️⃣ we saw significant progress on How to make QA systems run faster.

⌛️

If you reached this part, thank you for your attention!

Hopefully, this post is of any help and I will see you offline at ACL 2021! 🍀

--

--

Maria Khvalchik
Semantic Tech Hotspot

Researcher. Reading Comprehension, NLP, ML, and WN (whatnot)