Thoughts on “Attention is Not Not Explanation”

Byron Wallace
7 min readAug 14, 2019

Sarah Wiegreffe and Yuval Pinter have a nice new paper out with an undeniably fun title: “attention is not not explanation”, continuing an important ongoing conversation around attention and its interpretation in NLP. While the title might suggest otherwise, there actually seems to be a fair amount of agreement with our findings here. Wiegreffe and Pinter end up concluding that researchers ought to be careful in interpreting attention weights as explanations, writing that their results

should provide pause to researchers who are looking to attention distributions for one true, faithful interpretation of the link their model has established between inputs and outputs

They also offer up some diagnostics that might help decide when it may be appropriate to view attention weights as explanatory, which is an intriguing and potentially important direction. However, we’d argue that the fact that such diagnostics are necessary in the first place implies that attention is decidedly not equivalent to explanation.

“Attention is not explanation” in the same way that “correlation is not causation”. Of course, correlation can be causation, if special care is taken or key assumptions made (e.g., if randomization is in play or if one uses methods to adjust for all potential confounding), but it is not in general; hence the mantra (I suppose it is also correct to say that correlation is not not causation). So too with attention and explanation: Our argument is that one should not simply assume attention weights reveal the inputs responsible for an output. We explicitly noted in our paper that certain variants of attention may afford more faithful explanation (e.g., when the encoder does not entangle inputs, and variants of hard attention). But this should not be taken as a given, and specifically does not consistently hold in the RNN + attention variant commonly used. The results reported by Wiegreffe and Pinter would seem to agree with this, and they take the welcome next step of trying to design diagnostics to help practitioners, which is a nice contribution.

In our work we presented results interrogating two things. (1) Are attention weights consistently correlated with established feature importance scores (e.g., leave-one-out scores)? And, (2) do different attention distributions necessarily yield alternative predictions. Wiegreffe and Pinter find our results regarding the former compelling, and focus their work on the latter.

Heatmaps provide explanations for instances, not datasets

The subset of our experiments Wiegreffe and Pinter focus on and extend addressed this question: Had the model attended to different contextualized input representations, would the prediction have necessarily changed? To answer these we performed manipulation of attention weights on individual instances, because heatmaps highlight inputs in specific examples. In other words, heatmaps purport to explain individual predictions, i.e., why a model made this specific prediction; hence the motivation for manipulating instance-level weights. (Note that the “erasure” experiments of Serrano and Smith also perform instance-level manipulations.) One reading of heatmaps is: “this prediction was made because the model attended to these features”. Our experimental setup was intended to interrogate this, i.e., the degree to which the decoder (a dense classification layer or MLP that consumes the representation induced using attention) depends on the particular attention weights assigned, that is, those presented in a heatmap. These experiments were intended to complement the correlation results.

We ran two sets of experiments that looked at counterfactual attention distributions. First we randomly permuted observed attention weights and measured changes in output, and then we explicitly sought (via optimization) maximally different attention distributions that yielded equivalent predictions. Wiegreffe and Pinter do not discuss our permutation results much, focussing on the adversarial setting. However we found that randomly permuting attention weights often (though not always) yields relatively little change in the output, which seems problematic w.r.t. providing transparency. We simply shuffled (permuted) weights and compiled observed differences into the densities we report. This does not enjoy the same flexibility as the adversarial optimization case.

Wiegreffe and Pinter specifically focus on our “adversarial attention weights” experiments (Subsection 4.2.2 in our paper). They take issue with our adversarially manipulating attention weights for instances rather than globally (via attention mechanism weight parameters), arguing that this affords a large degree of flexibility in the optimization, which is true. But the problem is not trivial: one can observe a clear trend that for “peaky” (large maximum weight) attention distributions, it becomes harder to find distributions with large JS divergence (see here, under “adversarial scatterplots”). Their extension of our setup (where they find an adversarial model trained on all instances) is interesting and “harder” than our framing. But it also addresses a different question than what we were after in our work: we were interested in assessing local explainability, Wiegreffe and Pinter seem interested in global explainability.

It is true that manipulating attention weights “as a primitive” may (and likely will) yield configurations that are not likely under the attention mechanism. This is why these experiments were conducted and reported following our results concerning correlations with feature importance scores, which again suggest that attention is not obviously highlighting “important” features. It could nonetheless be the case that the “focus” on specific contextualized representations explains the prediction made in a casual sense. If this held, then heatmaps would have a nice interpretation: Manipulating them prior to the decoder would change the prediction. We set out to assess whether this property holds, not whether a general (global) adversarial model can be found (as Wiegreffe and Pinter do). We think the latter setup is interesting, it is just not the one that we investigated.

As we wrote in our paper “these experiments tell us the degree to which a particular (attention) heat map uniquely induces an output”. Wiegreffe and Pinter propose an extension that performs a global optimization over instances to train a model that induces divergent attention distributions without much changing predictions (across all instances). This is nice because it provides global, plausible counterfactual attention distributions, addressing the issue that optimized attention weights may be unlikely under the attention module (in some sense “unlikely” counterfactual examples); our adversarial experiments only established that alternative attention distributions that yield equivalent predictions exist. But then it is also different from what we had aimed to show. We did not claim to “show the existence of an adversarial model”; only that predictions (outputs) from the decoder are at least sometimes relatively invariant to assigned attention weights.

Our case against treating attention as providing faithful explanation in general was based on this combined with the aforementioned correlation results, as well as the permutation results. None of these point to attention being reliably interpretable as explanation in the transparency sense, so there is no reason we should. (People seem to have different priors on this; it is interesting that claims that something is interpretable seem to not receive much scrutiny, despite this being the stronger claim.)

In any case, in their setting, Wiegreffe and Pinter find that one can in fact sometimes identify such adversarial models (with different attention weights yet roughly equivalent predictions over the entire dataset), and conclude that this should give pause to researchers looking for “faithful interpretation” of models from attention weights. We agree!

Explanations without performance gains are still useful

Wiegreffe and Pinter decided to exclude datasets from their analysis where attention does not offer predictive improvements, writing “attention is not ex- planation if you don’t need it”. However this does not seem a priori obvious to me: If we believe attention provides transparency, then why would we not want to use it as an explanatory mechanism, even when it does not boost performance? We do not, for example, in general insist that L1 regularization boost performance in order to use it for feature selection.

If attention is (not not) explanation in general, then why should it only be when it improves performance? (Furthermore, taken to its logical end, shouldn’t we only consider attention explanatory for an instance when it improves performance on that instance?) I can see a case being made that it cannot provide meaningful explanation in such cases, but again it does not seem like a given. Perhaps this is a reasonable diagnostic, complementary to the others proposed in the work.

Higher-level thoughts

Wiegreffe and Pinter are correct to point out that alternative notions of interpretability exist. We have focussed on explainability in terms of transparency, i.e., where the model can point to why it made the prediction it did. Plausible “explanations” (that humans find intuitively agreeable) may be desirable in some cases. However we would argue that in many high-stakes scenarios, e.g., where a model is being used in a legal or healthcare setting (ignoring for the moment whether this is appropriate to begin with), a model that provides a plausible but at the same time unfaithful explanation would be the most dangerous possible outcome. This is a particular concern with respect to fairness; a model might provide as explanation an unprotected variable when it is actually basing its prediction on a protected attribute (especially if the unprotected and protected attributes are correlated). It also unclear what is actually being “explained” in such cases —certainly not the model output.

Given that the experiments presented in this work (along with our work and other recent, independent work by Serrano and Smith, who also conclude that attention “should not be treated as justification for a decision”, and just published work by Bruner et al. on self-attention which argues that “attention visualizations are misleading”) highlight potential problems with assuming that attention provides faithful explanations, we join Wiegreffe and Pinter in urging researchers and practitioners to think carefully about the assumptions being made when analyzing attention weights. And we welcome the research direction they propose of designing reliable diagnostics that may allow for such careful consideration.

Finally, a shout out to Sarthak Jain (lead on the original work) for providing the source code for all of our experiments that enabled Wiegreffe and Pinter to conduct this research efficiently! It is nice to see research progress quickly through open science.

Thanks also of course to Sarah Wiegreffe and Yuval Pinter for the continued engagement with our work, and for pushing forward our understanding of attention.