Explainability Of BERT Through Attention

Published in

Analytics Vidhya

5 min readNov 23, 2019

In this post we are going to take a step towards the explainability of BERT, explaining what BERT sees by analyzing the attention weights proposed in this recent paper “What does BERT look at? An Analysis of BERT’s Attention” (Clark et al., 2019).

Introduction

BERT (Bidirectional Encoder Representation from Transformer) was introduced back in 2018 by Google AI Language. It presented state-of-the-art results by excelling on a wide range of tasks like Machine Translation, Sentence Classification, Question Answering, etc.

BERT uses the Bidirectional training of Transformer(a purely attention-based model to capture long term dependency). The paper also introduces Masked-LM which makes Bidirectional training possible.

Investigating what aspect of language BERT is learning helps in verifying the robustness of our model. Since BERT is also based on the Attention mechanism, the easiest way is to visualize the attention weights for the different input sentences. Some of the other ways to better understand the language models are to examine the outputs on carefully handcrafted sentences or using probing classifiers which investigates the internal vector representations of the model. Some of the findings that the author argued about BERT are -

BERT’s attention head exhibit patterns such as attending to delimiter tokens, specific positional offsets or broadly attending over the whole sentence.
Attention heads in the same layer often exhibits similar behaviors.
Some of the heads correspond well to linguistic notions of syntax and coreference.

From the recent works, we can say that pre-trained BERT embeddings teach the language structure to the model but we don’t exactly know what kind of language features our model is learning. Here are a few plots that show the different linguistic features that different heads of the model have learned with the help of attention weight.

Examples of different heads attending to different parts of sentences.

Attention on Separator Token

One of the key observations that the author made is that a substantial amount of BERT’s attention is focused on just a few tokens. For example, more than 50% of the BERT’s attention in layer 6-10 focuses on [SEP] token. [SEP] and [CLS] tokens are guaranteed to be present and they are never masked out.

Each point corresponds to the average attention a particular BERT attention head puts towards a token

A possible reason could be that these special tokens act as “no-op” when the attention head is not applicable. To validate this hypothesis the author computed gradient-based loss with respect to each of the attention weights. The result showed that starting from layer 5-the layer where [SEP] attention weight becomes high, the gradients for attention to [SEP] become small indicating [SEP] doesn’t substantially affect BERT’s output.

Gradient-based feature importance estimates for attention to [SEP], period/commas, and other tokens

Attending the whole sentence Vs Attending the individual tokens

To analyze how broad is the attention, we calculate the average entropy of each head’s attention distribution. The authors found that some of the attention heads in the lower layer have very broad attention. These heads spend at max 10% of their attention mass on any single word. The output of these heads are nearly a bag of word representations. While middle layers have somewhat narrowed attention.

They also measured the entropy of [CLS] token for all attention heads. For the starting layers, the entropy is high and then it decreases as attention weight increases for middle layers (6-10) but then it again increases to around 3.89 indicating very broad attention. As we go towards the final layer the focus of attention shifts from redundant words like [CLS],” , ” etc to more significant words.

Learning Syntax

Now, we will investigate what aspect of language does our attention heads learn. BERT uses byte-pair tokenization, this tokenizer sometimes split words into multiple tokens. So to evaluate word-level attention heads token-token attention maps are converted to word-word attention maps. We find the attention in the following ways :

Split-up word to single word: take mean of attention weights over its tokens.
Single-word to the split-up word: sum over all the attention weights over its tokens.

They also presented that some of the attention heads specialize to capture specific dependency relations. Here visualization of the attention maps of two sentences corresponding to heads 8–11(noun modifiers) and heads 9–6 (prepositions) is shown, these maps refers to the attention weights between all pairs of words in a sentence. Dark-colored lines indicate more attention weights.

Clustering the Attention heads

Another important result presented in this paper was Heads within the same layer are often fairly close to each other, meaning heads within a layer have similar attention distribution. To find this Jensen-Shanon Divergence is used on all possible pairs of attention heads. To visualize these similarities each head is projected on to two-dimensional space using multidimensional scaling. It turns out to be that the attention heads corresponding to a layer are close to each other and they form a cluster.

Distance Between two heads, where JS is Jensen-Shanon Divergence

BERT attention head embedded in the 2-D plane.

From the above figure, it is evident that attention head within the same layer gets clustered and has similar attention weights.

Note-Jain and Wallace(2019) argued that attention often doesn’t explain model prediction and these attention weights frequently do not correlate with other measures of feature importance. I believe that attention is just one of the several factors that play an important role in deciding the outcome.

References

Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019. What does BERT look at? an analysis of bert’s attention. CoRR, abs/1906.04341.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 6000–6010.
Kawin Ethayarajh, David Duvenaud, Graeme Hirst. Towards Understanding Linear Word Analogies, ACL 2019.
Jain and Wallace (2019) Sarthak Jain and Byron C. Wallace. 2019. Attention is not explanation. CoRR, abs/1902.10186.