Summary: Pathologies of Neural Models Make Interpretations Difficult (EMNLP 2018)

Sameer Singh
Published in
2 min readOct 8, 2018


Authors: Shi Feng, Eric Wallace, Alvin Grissom II, Mohit Iyyer, Pedro Rodriguez, Jordan Boyd-Graber

Interpretability of machine learning often focuses on what are the most important parts of the input for the model: that’s what the gradient-based saliency methods do, that’s what LIME does, and that’s what the methods for identifying adversaries do. The underlying notion is to identify parts of the input that, if removed, result in a different prediction by the model.

Instead, this paper focuses on input reduction, removing as much of the input as possible, without changing the prediction. Similar ideas have appeared recently with different names: I personally like overstability, like in another paper at EMNLP 2018, and we used sufficient conditions to describe a related concept AAAI 2018. These approaches find the parts of the input that the model is not looking at: ones we can remove without changing the prediction.

Back to the paper, they study three important NLP tasks (SQuAD, VQA, and SNLI) and pick a model each. They remove tokens iteratively by picking ones with the smallest gradient till the prediction changes, thus finding the largest change to the input that doesn’t change the prediction. They also include a beam-search variation of this greedy approach. The examples, some shown below, are really “impressive”!

From the original paper (link:

The paper, after confirming this reduction is pathological and is no way meaningful to humans, goes on to provide some insightful thoughts on why is it that saliency/importance based methods pick parts so different from the reduction ones. In particular, they point out how neural nets can be over-confident about arbitrary-looking input, and how gradient/saliency-based methods (or even LIME, since it makes a linear approximation) do not capture the strong second-order effects that are clearly present in the neural networks. As an additional contribution, they introduce an entropy-based data-augmentation loss (“the model should not be confident on these reduced inputs!”) and show that it addresses the problem (to some extent).

Being a big fan of such tools for analyzing neural networks, this paper was a really fun read for me. It provides even more examples of how these systems are brittle, but doesn’t just go “haha, look at this terrible model”, but instead gives useful insights for why this might happen, and some potentially impactful ideas on addressing this. To continue this line of research, I am curious about characterizing these observations somehow in a useful way, e.g. in retrospect, if we had this tool a few years ago, how would it helped us diagnose that SQuAD models are mostly performing a version of lexical matching?