Robustness of Modern Deep Learning Systems with a special focus on NLP

12 min readJan 20, 2021

Modern neural network models have achieved outstanding performance on many tasks ranging from Computer Vision to Language Processing. This high performance should entail that they have become exceptionally good at solving such tasks but recent works have shown that “these models often solve a dataset, not the task” as they make surprising failures when given inputs that are slightly different from the ones seen during training or when the inputs are perturbed (adversarial inputs). Researchers in NLP will readily concede that while we have made good progress, we are far from having machines that can truly understand natural language i.e we still are far from AI that can understand language at a human level. Ideally, models should not be highly sensitive to discrepancies between the assumption and reality. Robustness in statistical learning could be associated with different properties, for instance, a model’s performance should not get impacted due to change in domains (robustness to domain shift), due to perturbation in inputs (adversarial robustness), etc. In this article, we will discuss important topics pertinent to the robustness of modern deep learning systems.

Outline:

Diagnostic Datasets
Dynamic Benchmarks
Data Quality
Adversarial Robustness
Out-of-Distribution Reliability
Other Prominent Robustness-related Works in NLP

My Papers on Robustness:

Post-Abstention: Towards Reliably Re-Attempting the Abstained Instances in QA — ACL 2023
Investigating Selective Prediction Approaches Across Several Tasks in IID, OOD, and Adversarial Settings — Findings of ACL 2022
Towards Improving Selective Prediction Ability of NLP Systems — Repl4NLP @ ACL 2022
A Unified Evaluation Framework for Novelty Detection and Accommodation in NLP with an Instantiation in Authorship Attribution — Findings of ACL 2023
“John is 50 years old, can his son be 65?” Evaluating NLP Models’ Understanding of Feasibility — EACL 2023
Model Cascading: Towards Jointly Improving Efficiency and Accuracy of NLP Systems — EMNLP 2022

Diagnostic Datasets

We’ll start this article with a very important concept of Diagnostic Datasets. The motivation behind creation of a diagnostic dataset is to uncover weaknesses of the models and exactly pinpoint their shortcomings. Each question in a diagnostic dataset should have detailed annotations describing the kind of ability/skills a solver requires to answer it. Modern Neural Network-based models have been shown to be adept at exploiting the dataset artifacts and statistical biases present in the examples. For instance, in Visual Question Answering (given an image, answer a Natural Language question pertinent to it), a statistical learner may correctly answer the question “What covers the ground?” not because it understands the scene but because biased datasets often ask questions about the ground when it is “snow-covered”.

Another instance where models rely on surface-level texture or clues in the image’s background to recognize foreground objects even when that seems both unnecessary and somehow wrong such as “the beach is not what makes a seagull a seagull”. Researchers struggle to articulate precisely why models should not rely on such patterns.

Hence, a diagnostic dataset should have minimal biases enabling fair evaluation of models. Some diagnostic datasets synthetically create questions to test a specific aspect of the models. Furthermore, benchmarks should always provide a detailed breakdown of accuracy by task and linguistic/visual phenomenon.

Examples of diagnostic datasets:

CLEVR(VQA) — Questions test aspects of visual reasoning such as attribute identification, counting, comparison, multiple attention, and logical operations.

The CLEVR universe contains three object shapes (cube, sphere, and cylinder) that come in two absolute sizes (small and large), two materials (shiny “metal” and matte “rubber”), and eight colors.
Objects are spatially related via four relationships: “left”, “right”, “behind”, and “in front”.
It also includes one non-spatial relationship type (same- attribute relation). Two objects are in this relationship if they have equal attribute values for a specified attribute.
Images and Questions are synthetically generated in CLEVR:
A scene can be represented by a scene graph where nodes are objects annotated with attributes and edges connect spatially related objects. CLEVR images are generated by randomly sampling a scene graph and rendering it using Blender.
A question is associated with a functional program that can be executed on an image’s scene graph, yielding the answer to the question. Functional programs are built from simple basic functions that correspond to elementary operations of visual reasoning such as querying object attributes, counting sets of objects, etc.
HANS(Heuristic Analysis for NLI Systems) —This test set diagnoses Syntactic Heuristics in Natural Language Inference(NLI) task. Specifically, it tests three heuristics: lexical overlap, subsequence, and constituent. Examples of all three heuristics are shown below:

Other NLP Diagnostic datasets:
— SuperGLUE Diagnostic Dataset

Some of the recent works critique the paradigm of fine-tuning on a training set and evaluating on a test set drawn from the same distribution as the training set. This is because it favors the models that can capture the fine-grained statistical properties of a particular data set, regardless of whether those properties are likely to generalize to examples of the task outside the data set distribution. This is very much different from how humans learn from several orders of magnitude less data than the models favored by this evaluation paradigm. They support the creation of test-only benchmarks because, despite our best efforts, we may never be able to create a benchmark that does not have unintended statistical regularities. Now, we’ll jump to the topic of Dynamic Benchmarks in the next section.

Dynamic Benchmarks

Static benchmarks that evaluate systems on fixed datasets suffer from the following major problems:

Static benchmarks saturate quickly — The field is progressing so quickly that static benchmarks can saturate quickly. Once a benchmark gets saturated, researchers start to look for new benchmarks and the older ones become obsolete.
Static benchmarks are susceptible to overfitting and can contain exploitable annotator artifacts — In the strive to achieve state-of-the-art (SOTA) performance on a static benchmark, models often overfit to solve particular examples present in the benchmark. While the high accuracy may look impressive, this can be misleading. For instance, near-human performance on a QA “benchmark” doesn’t entail that the QA “task” is solved. It has been shown that SOTA models when tested on controlled examples, make mistakes that a human would rarely make. Models, unable to discern the intentions of the dataset’s designers, exploit any statistical patterns they find in the training data. With a random training/test split, any correlation observed in the training set will hold approximately for the test set and a system that learned it could achieve high benchmark accuracy.
Static benchmarks do not often emulate realistic scenarios — Ultimately, we intend to create systems that can work together with humans, a static benchmark even if collected via crowd-sourcing often represent a toy task.

Facebook AI recently introduced Dynabench, a platform for dynamic data collection, and benchmarking. It employs a dynamic adversarial data collection technique that involves both humans and models. Humans are asked to craft examples that fool the current SOTA models. The benefits of this technique are twofold: it evaluates the SOTA models on real-world inputs given by humans (yielding insights into the mistakes that current models make) and it results in data that may be used to train future models.

How it overcomes the problems of static benchmarks?

Since this is a cycle, the process cannot saturate.
It can automatically fix annotation artifacts and other biases over time.
It allows us to measure performance in ways that are closer to real-world applications since humans are involved in the loop.

Dynabench offers a more accurate and sustainable way for evaluating progress in AI. It uses both humans and models together “in the loop” to create challenging new datasets that will lead to better, more flexible AI.

Data Quality

Data serves the purpose of fuel for Deep Learning and with an increase in quantity of data in the recent past, the quality of data has become very important. Recent works have highlighted the following data quality-related issues:

Not all examples in a dataset contribute equally towards learning [11].
Dataset artifacts could lead to weak and biased models that have poor generalization. Bias of crowd workers is another reason for accumulation of dataset artifacts.
High performance achieved by large models is sometimes a result of statistical cues present in the training data that these models capture and leverage on the test data.
Test Set of some benchmarks has very similar examples to the train set. Hence, a model that memorizes the train set well performs well on such datasets. Recent works have found that such datasets lead to models that are extremely bad on examples that can’t be memorized from the training set [10].
Recent works have highlighted that some tasks, such as paraphrase detection and open-domain QA, naturally have extreme label imbalance (e.g., 99.99% of examples are negatives) i.e in question deduplication, the vast majority of pairs of questions from an online forum are not duplicates.
In Open-domain QA, almost any randomly sampled document will not answer a given question. Random pairs of sentences from a diverse distribution will have no relation between them in NLI as opposed to an entailment or contradiction relationship. Many recent datasets heuristically choose examples to ensure label balance, generally for ease of training. QQP was generated by mining non-duplicate questions that were heuristically determined to be near- duplicates. SNLI had crowdworkers generate inputs to match a specified label distribution.
Quantifying easiness/difficulty of instances in the training dataset — This can be done using the prediction probabilities of the model. The intuition behind this is that if a model is very confident (and correct) across different runs/epochs then that sample is easy for the model. On the other hand, if the model is incorrect or has very low confidence then that sample is difficult for the model. The below image shows the distribution of training instances of SNLI dataset.

Source: Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics.

Adversarial Robustness

The interpretability of DNNs is still unsatisfactory as they work as black boxes, which means it is difficult to get intuitions from what each neuron exactly has learned. DNNs are vulnerable to strategically modified samples (adversarial examples). State of the art models that achieve high accuracy on specific datasets often make surprising failures on inputs that are slightly perturbed without altering their meaning. For instance, perturbing an image of a “panda” (adding imperceptible noise to the original input) has been shown to successfully fool SOTA image classification model to incorrectly classify it as “gibbon”.

Figure: Perturbation in Image Classification.

Recently, even in Natural Language Processing (NLP) tasks, simple perturbations like replacing words with their synonyms, or inserting typos, etc. seriously challenge models’ robustness.

Why the Image methods to fool can’t be applied directly to Text:

Image is continuous (pixels) but the text is symbolic thus discrete. When applying gradient-based adversarial attacks adopted from images on these representations, the generated adversarial examples are invalid characters or word sequences.
Perturbations in images (small changes of pixel values) are hard to be perceived by human eyes and humans can still correctly classify such inputs. To show poor robustness of image models, we need to fool the model using such inputs. But for texts, small perturbations are easily perceptible. For example, replacing some characters or words would generate invalid words or syntactically-incorrect sentences.
Perturbing the text also alters the semantics of the sentence. Therefore, perturbations can be easily perceived.

Types of adversaries in Text:

Concatenation Adversaries — Append distracting but meaningless sentences at the end of the paragraph. These sentences don’t change the semantics and can be carefully generated informative sentences or arbitrary sequences of words. For instance, in the paper “Adversarial examples for evaluating reading comprehension systems”, authors show that appending some text to the paragraph leads the model to predict incorrectly.

Source: Adversarial examples for evaluating reading comprehension systems.

Edit adversaries — There are two types of attacks in this category: “Should not change” attacks and “Should change” attacks. “Should not change” attacks include modifying the text in such a way that preserves the label of the original text. Examples of such methods: Randomly swapping neighboring tokens, Stop word dropout, Paraphrasing, Grammar errors. “Should change” attacks include methods like Add negation strategy (negate the root verb of the source input), replacing with antonyms.

Defense against adversarial attacks:

There are two popular approaches to defend against adversarial attacks: Adversarial Training and Knowledge Distillation.

Adversarial Training — use adversarial examples during training. Data can be collected using data augmentation strategies like synthetically generating or using language generation models.
Knowledge Distillation with Temperature Scaling — Will add soon.

Out-of-Distribution Reliability

Deep neural networks are often trained with closed-world assumption i.e the test data distribution is assumed to be similar to the training data distribution. However, when employed in real-world tasks, this assumption doesn’t hold true leading to a significant drop in their performance. Though this performance drop is acceptable for tolerant applications like product recommendations, it is dangerous to employ such systems in intolerant domains like medicine and home robotics as they can cause serious accidents. An ideal AI system should generalize to Out-of-Distribution (OOD) examples whenever possible and flag the ones that are beyond its capability to seek human intervention. There is a separate dedicated article on OOD here. Here is an outline of that article:

A bit on OOD
— Why OOD Detection is important?
— Why Models have OOD brittleness?
— Types of Generalizations
— Plausible Reasons for Higher Robustness of pre-trained models (like BERT) than Traditional Models
— Other Related Problems
Approaches to Detect OOD instances
— Maximum Softmax Probability
— Ensembling of Multiple Models
— Temperature Scaling
— Training a Binary Classification model as a Calibrator
— Monte-Carlo Dropout

Other Prominent Robustness-related Works in NLP:

BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance
This paper explores the following question “If the same architecture is trained multiple times on the same dataset (with different initial weights and/or different order of training instances), will it make similar linguistic generalizations across runs?
To this end, they train 100 instances of BERT on MNLI and evaluate on MNLI and HANS dataset. They found that on the MNLI dataset, the behavior of all instances was remarkably consistent, but the same models varied widely in their generalization performance on HANS dataset. They attribute this behavior to the presence of many local minima that are equally attractive to a low-bias learner such as a neural network; decreasing the variability may therefore require models with stronger inductive biases.
Pretrained Transformers Improve Out-of-Distribution Robustness
This work compares the OOD performance of transformer models with BOW, LSTM, and word2vec models.
They found that transformer models like BERT, RoBERTa, etc. are more robust than BOW, LSTM, and word2vec models. They attribute this to the pre-training of transformers that include both a self-supervised learning objective and a diversity of data. Furthermore, they found that “More diverse pretraining data can enhance robustness as RoBERTa exhibits greater robustness than BERT Large”.
Selective Question Answering under Domain Shift
This work explores the setting of selective answering where a model can choose to abstain from answering when it is not sufficiently confident. They propose a calibration technique that lets the model decide when to answer and when to abstain. I’ve discussed more on this technique in another article.
On the Importance of Adaptive Data Collection for Extremely Imbalanced Pairwise Tasks
Will add soon.
LEARNING THE DIFFERENCE THAT MAKES A DIFFERENCE WITH COUNTERFACTUALLY-AUGMENTED DATA
Will add soon.

Check out my related articles :

References:

Johnson, Justin, et al. “Clevr: A diagnostic dataset for compositional language and elementary visual reasoning.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017.
McCoy, R. Thomas, Ellie Pavlick, and Tal Linzen. “Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference.” arXiv preprint arXiv:1902.01007 (2019).
Wang, Alex, et al. “Superglue: A stickier benchmark for general-purpose language understanding systems.” Advances in neural information processing systems. 2019.
Potts, Christopher, et al. “DynaSent: A Dynamic Benchmark for Sentiment Analysis.” arXiv preprint arXiv:2012.15349 (2020).
https://dynabench.org/
McCoy, R. Thomas, Junghyun Min, and Tal Linzen. “Berts of a feather do not generalize together: Large variability in generalization across models with similar test set performance.” arXiv preprint arXiv:1911.02969 (2019).
Hendrycks, Dan, et al. “Pretrained transformers improve out-of-distribution robustness.” arXiv preprint arXiv:2004.06100 (2020).
Kamath, Amita, Robin Jia, and Percy Liang. “Selective question answering under domain shift.” arXiv preprint arXiv:2006.09462 (2020).
Linzen, Tal. “How Can We Accelerate Progress Towards Human-like Linguistic Generalization?.” arXiv preprint arXiv:2005.00955 (2020).
Lewis, Patrick, Pontus Stenetorp, and Sebastian Riedel. “Question and answer test-train overlap in open-domain question answering datasets.” arXiv preprint arXiv:2008.02637 (2020).
Swayamdipta, Swabha, et al. “Dataset cartography: Mapping and diagnosing datasets with training dynamics.” arXiv preprint arXiv:2009.10795 (2020).
Jia, Robin, and Percy Liang. “Adversarial examples for evaluating reading comprehension systems.” arXiv preprint arXiv:1707.07328 (2017).
Mussmann, Stephen, Robin Jia, and Percy Liang. “On the importance of adaptive data collection for extremely imbalanced pairwise tasks.” arXiv preprint arXiv:2010.05103 (2020).