Attention is not not Explanation
[Update, August 13 — December 6, 2019: Sarah Wiegreffe and I performed experiments to follow up on the points here, as well as constructive setups for detecting and claiming faithful explanation, presented at EMNLP 2019. The paper is available here.
Byron Wallace responded to the paper here.]
[This post is intended for an NLP practitioner audience, and assumes its readers know what attention modules are and how they are being used.
All feedback is welcome, either here or to firstname.lastname@example.org or to @yuvalpi on Twitter.]
An upcoming NAACL paper was uploaded to arXiv earlier this month, and has been making the rounds on social media. The title chosen for it was Attention is not Explanation; the authors are Sarthak Jain and Byron C. Wallace (from here on I will refer to them, and the paper, as J&W). Such a title sets high expectations for a rigorous, convincing proof of the claim. In this post I argue that it does not deliver on them.
Briefly, my main points are:
- Explanation can mean different things, and J&W are not clear on which interpretation they wish to disassociate from attention models;
- Their first empirical study, a correlation-based analysis of attention scores, is neither sufficient to advance the claim nor convincing in its results;
- Their second study, an attention-distribution manipulation experiment, is orthogonal to the claim, and designed with such a high degree of freedom, that its results have little to no meaning.
I write this critique in the hopes that future work treats the paper with caution, not taking the (admittedly catchy) title for its word. Attention Might Still Be Explanation.
Main Claims of J&W
J&W make the following assertions off the bat, preparing us for the experiments to come. The assertions are left unmotivated, leaving us to accept them based on common sense:
(i) Attention weights should correlate with feature importance measures (e.g., gradient-based measures);
This claim seems reasonable, although one could quibble about the level of correlation needed considering the fact that attention scores are baked into the trained model being examined.
(ii) Alternative (or counterfactual) attention weight configurations ought to yield corresponding changes in prediction (and if they do not then are equally plausible as explanations).
This assertion I find troublesome from an inferential perspective: my explanation for why it’s raining today may involve the ocean streams, atmospheric pressure, cloud formations. An alternative explanation could cite anger from the god of thunder. It yields the same prediction, but I wouldn’t call it equally plausible.
Correlation is not Correlation
We’ll now look at the two results provided by J&W in order to support their claim. These are: (a) a correlation analysis of attention scores with those of other interpretability measures; and (b) an adversarial search for alternative attention distributions which result in the same predictions.
J&W look at the following types of tasks: seven binary prediction datasets; the ternary-prediction SNLI; bAbI, where statements are selected from a small selection; and CNN, a QA task where the answer is a single entity appearing in a paragraph. They explicitly avoid seq2seq models, claiming that attention is not used much for explanation purposes in that literature. Underneath the attention layer, they train an LSTM variant, a ConvNet (here’s me avoiding the unfortunate ambiguous acronym), and an average-over-feedforward model.
The correlation study (section 3.1) pits score distributions produced by attention models against other, well-grounded accounts of feature importance, namely gradient analysis and leave-one-out. The correlation metric chosen by J&W is Kendall-tau, a rank correlation which does not take score values into account. I think this choice is unfavorable to contextual models with soft attention distributions: these are characterized by long tails where differences in score are inconsequential, resulting in many near-arbitrary ranks that don’t really tell us much but lower the correlation score. The averaged-feedforward baseline is likely to correlate better, since it sees no context and the individual tokens must account for the predictions all by themselves. [Incidentally, the paper does not report prediction results for the baseline model. Browsing the project’s website reveals it is often substantially inferior to LSTM and ConvNet, as much as halving accuracy on the CNN QA task.]
Having given our excuses, we can turn to the actual scores reported by J&W (Table 2). To get a sense of Kendall tau, a score of 0.33 for a sentence means that two words are twice as likely to be ranked in the same order by both the attention module and the ‘principled’ analysis method as they are to disagree in order. Only one classification dataset in J&W, 20News, fails to reach this level when compared with the Gradient method. Another concern at this point is that we’re not presented with the correlation between the two non-attention techniques themselves as a baseline. Arguably, if these are also low, the message is simply that multiple distributions can be “good” without correlating well (undermining the authors’ main claim (i)).
Finally, although gradient and leave-one-out are tried and true for measuring the impact of each token on a prediction, there is another evaluation over which attention models fare well: human evaluation. Consider this study by Mullenbach et al., where (section 4.2) given possible explanations for medical code prediction by different models, expert practitioners judged the attention scores to be the most informative. A similar examination by J&W would have made a more compelling case.
Counterfactual Distributions are not Counterfactual Weights
In their second experiment (section 4.2), J&W manipulate the attention distributions in the models examined on all tasks for each instance, first at random (4.2.1) and later adversarially (4.2.2), to conclude that it is easy to “explain” model predictions in ways other than those in the original distribution. I disagree both with the power claimed in the experiments and with the conclusive reasoning, for several reasons.
Existence does not Entail Exclusivity.
On a theoretical level, attention scores are claimed to provide an explanation; not the explanation (recall the rain analogy). If the final layer of a model produces outputs which may be aggregated into the same (correct) prediction in various ways, it still makes the choice (having trained an attention component) of a specific weighting distribution. Moreover, when we consider that most tasks selected by J&W are binary, the experimental setting maps into a vast amount of model freedom: aggregating 180 scalars (the average length of an IMDb instance) to a desired prediction scalar in the (-1,1) range cannot be expected to be a difficult task when no other constraints exist.
Indeed, the most open-ended task, QA over CNN data, produces considerable difficulty to manipulate its scores by random permutation (Figure 3f in the paper). Similarly, the adversarial examples presented in Appendix C for the QA datasets select an incorrect token instance of the correct type (garden, entity13), which should not surprise us given that the underlying model is an LSTM — encoder hidden states are typically affected by the input word to a noticeable degree.
Given all this, if we use attention scores to explain to a user why a model picked a certain token, shouldn’t we be glad that despite being able to reach the same prediction by giving a high weight to a different token with the same type, the attention mechanism managed to focus on the correct instance?
Attention Distribution is not a Primitive.
From a modeling perspective, detaching the attention scores obtained by parts of the model (i.e. the attention mechanism) degrades the model itself.
The attention weights, after all, were not “assigned” in some post-hoc manner by the model as the permutation protocol assumes, but rather computed by an integral component whose parameters (e.g. the linear maps Wᵢ in the Additive attention flavor) were trained alongside the rest of the layers. The way they work depends on each other. J&W provide alternative distributions which may result in similar predictions, but in the process they remove the very linkage by which attention modelers motivate the explainability of these distributions, namely the fact that the model was trained to attend to the tokens it chose.
[This concern also applies to the correlation study, where attention is compared to word importance by gradient flow: to keep as much of the models’ components comparable across salience distributions, J&W measure the gradient flow below the attention level only. But these are the layers that are not meant to be affected by individual words, since attention does that for them. This might be another factor lowering correlation.]
Adding the fact that the setup does not enforce any degree of consistency across instances, it seems that there isn’t really any evidence in the paper of a coherent adversarial model which still makes the same predictions as the originally-trained one.
Not all Labels are Alike.
Of the tasks selected by the authors, one stands out in its resilience to counterfactual manipulation: predicting Diabetes from clinical notes, using either ConvNet or LSTM. Its output difference (Figures 3c-d, 4c-d) follows a distinct pattern: the positive-label instances are very hard to manipulate, whereas the negative-label ones are not [The JS divergence distribution for the positive class in LSTM may seem impressive, but keep in mind the model achieved a lukewarm 0.79 F1 score on this task]. I suggest that this behavior emerges from the nature of a true 0–1 task: the presence of diabetes is likely to be described using a short phrase or several domain-salient words, but its absence is likely not to be mentioned at all, rendering any part of the document as (non-)informative as the rest, and thus allowing many alternative distributions to be as useful as the original one. Similar arguments could be made for the results on the different classes of SNLI (Figures 3g-h, 4e-f).
Explanation is not Explanation
Explanation is a loaded term, both in official AI terminology and in human conceptual thought. Some recent literature has taken care to tease away the various senses commonly conflated in the use of this term and some related ones such as transparency and interpretability. Most notable is Zachary Lipton’s 2016 survey, The Mythos of Model Interpretability. J&W cite it, but still use the terms somewhat interchangeably, making it hard to understand which sense they believe people are using attention as a signal for, or which it should not be used for. As an example, consider the nuance separating these two notions of attention:
- Attention as a sanity check: we, who built the (say) translation model, have an idea which words in the source text “should” map to which words in the target text, and it would be a neat demo if a component in the model shows us exactly the patterns we expect.
- Attention as a tool: the model is looking at (say) the patient visit summary, predicting which condition was diagnosed (supervised by existing annotation), and tells us through attention which part of the text caused it to make the prediction. The outputs of the model may now help in seeding a glossary for medical diagnosis, for example.
J&W appear to be targeting the first usage, whereas for example the diagnosis study I cited before is concerned with the second. Would definitive results from well-prepared experiments on J&W’s questions even have consequences towards this type of work?
Attention might be explanation. It might not be explanation. Whether or not it is depends heavily on the underlying model architecture; on the task; on the sense of “explanation” we’re after. I haven’t touched points such as use of attention in generation tasks such as image captioning, where visual attention is quite convincingly argued to be explanatory. But one thing I believe in is that broad claims require rigorous argumentation and convincing empirical results.
I thank my labmates at Georgia Tech’s Computational Linguistics lab, and especially Sarah Wiegreffe, for the discussion that led to this post and notes on its draft; errors and tone are all mine.