What does BERT look at?

Viktor
DAIR.AI
Published in
8 min readMay 6, 2020

Attention lies at the core of the exponential improvements achieved within NLP, and BERT’s success in particular. There has never been a better time to ask why and how this has been possible. What aspects of language are captured by these attention heads, how do they utilize special pre-training tokens and have these models actually achieved language understanding? These are some of the questions that will be discussed in this paper summary!

Introduction

The rapid success of transformer-based models within the NLP field has spawned a research field onto itself — how are these models able to achieve such high performance on so many diverse tasks?

One way to shed some light on this topic is to examine what linguistic features have been encoded in the model after pre-training. Previous work has approached this through both studying the output embeddings as well as internal representations. This leaves an interesting component of Transformer based models untouched — the attention mechanism. The information encoded by this mechanism is what Clark et al. thoroughly examined in their paper What does BERT look at? An analysis of BERT’s attention. In this article, I will try to present an overview of their findings and a discussion of the implications mentioned.

The results are divided into three sections, matching how Clark et al. approached their research. First, we will cover surface-level attention patterns to find the average behaviour of heads across each layer. Then, the study focuses on the linguistic features learned by each individual attention head. Finally, we will discuss the attempt of utilizing the results from previous sections through a so-called probing classifier.

Surface level attention

A study of the surface level attention patterns allowed Clark et al. to figure out what the head at each layer of a pre-trained BERT model had learned. They approached this from a birds-eye view to find the general behaviour, thus not focusing on any head in particular but rather the average behaviour across each layer.

Relative position

They found it to be rare for heads to attend itself while more common for ones to attend the token directly before or after itself. This behaviour was, however, most prominent for early layers, which would indicate local information gathering at this stage of the model.

Special tokens

The study also highlights that many heads attend heavily, around 50%, to the special [SEP] token. Heads in layers 5 through 10 for the 12 layer model (BERT base) displayed this behaviour.

Average attention for three special tokens across layers of the BERT base model. Source

One might argue that the SEP tokens, which are placed between sentences during pre-training for next-sentence prediction, aggregate sentence-level features. That idea is however proven incorrect by, in part, examining the average gradients for each layer with respect to this special token. Looking at gradients in this way answers how much the output would be affected by a change in the vector for this token. The authors found that these gradients were diminishingly small for the same layers where so much attention was assigned to this particular token.

Average gradients across layers. Source

Further, heads with specific linguistic behaviour, which will be discussed in the next section, attended [SEP] (or delimiter tokens) in cases where its particular linguistic task did not apply. The conclusion we are forced to come to is, therefore, that attending [SEP] is a no-op, a default when there is no information to be gathered.

There are attention heads that do not provide any obvious benefit to the models. This could be an indication of what others also alluded to; over parameterization.

What would be interesting to study is the link between this no-op and total parameters in the model. The particular case that would be interesting to investigate is how this phenomenon plays out for a distilled model like TinyBERT, which is forced to be more parameter efficient due to its distillation training.

Broad or narrow attention

Finally, Clark et al. studied the attention distribution for heads at each layer. The authors tried to figure out how spread out or focused the heads distribute their attention. To answer that question, each head’s entropy was calculated. Entropy might seem like a weird metric to use here, but I’ll explain why it's not. It is a measure of structure — how disordered a system is. Higher entropy indicates more randomness, or less order, which in our case means that the attention is evenly distributed across many tokens.

Average attention entropy across layers. Source

The authors found that entropy took the shape of a parabola when plotted as a function of the layer. Early layers thus had broad attention, middle ones more narrow and the last ones wide again. This holds true for the special CLS token too, which was found to have extra broad attention in the last layer. Intuitively, this makes sense — the sequence representations are aggregated in this token and therefore need to capture information more broadly.

Individual attention heads

Having provided an overview of how attention differs between layers and how it behaves for special tokens, let’s move our attention to the linguistic features individual heads capture.

To evaluate individual attention head’s language syntax capabilities, Clark et al. use the WSJ dependency parsing dataset, annotated with Stanford Dependencies. This allowed them to evaluate the heads on a multitude of different tasks compared to simple baselines.

Their findings are quite surprising! Even though there was no single head able to perform exceptionally well across all tasks, there exist individual heads that had mastered one task, often significantly outperforming the baseline! The fascinating realization is that the model reconstructed many of the linguistic relations of the English language entirely through the task of masked language modelling.

Example of linguistic features captured by head 10 and 11 in layer 8. Source

Coreference resolution was also used to evaluate the heads' linguistic capabilities. This task is more challenging due to longer syntactic dependencies that naturally occur, which is why many neural models struggle in comparison to parsing. Clark et al. measured the attention heads capability at this task by how often a coreferent mention attended its antecedents.

Compared to simple baselines (fixed offset), there exists an attention head that performed significantly better. However, it did not match a rule-based system nor models trained specifically for this task.

Example of head attending coreferent mentions to their antecedents. Source

Probing classifiers on attention heads combinations

With the previous findings of what each attention head is able to learn, it could be interesting to see how they can be combined to perform any task thrown at them. This can be achieved through training a probing classifier. Such a model utilizes the internal state of a model, in our case the attention maps, to estimate the probability of a word being the syntactic head of another (a general linguistic relation that holds true for many of the Standford Dependencies previously discussed).

Two probing classifiers were proposed. First, a simple one that only utilizes the attention values between word i and word j, and between word j and i. It‘s, however, naive to think that this information, the attention values and which head they come from, is enough to determine the full relation between the two words. Therefore, a second probing classifier was proposed which in addition to the attention values also considers word information. This was encoded through GloVe vectors for which the two-word vectors were concatenated and projected using a vector of learnable parameters. This created a word-aware parameter that was used together with the attention weight in the previous model.

(Top) Probing classifier utilizing attention weights between word i and j for each layer k to estimate the probability of word i being the syntactic head of j. (Bottom) Probing classifier also incorporating word information through GloVe embeddings v for both words.

Again, these models are compared against simple baselines as well as models utilizing the output vectors from BERT to understand the effectiveness of the information encoded in attention weights. Clark et al. found that their model performed better than the baselines and comparable with models utilizing BERT’s output embeddings.

This allows them to draw the conclusion that much of the same information stored in the token embeddings can also be found in the attention maps. It also strengthens the growing body of evidence that indirect supervision from rich pre-training tasks such as masked language modelling can produce models sensitive to a language’s hierarchical structure.

Now what?

Do all their results indicate that transformer models after language modelling pre-training actually understand language? Have our models peaked?

Not really.

There are still areas where a proper understanding of language is far from achieved. A particularly highlighting case was presented by Niven and Kao in their paper Probing Neural Network Comprehension of Natural Language Arguments. In this particular task, they evaluate BERT on Argument Reasoning Comprehension. They require the model to select the piece of world knowledge from two alternatives that make the claim valid. This requires what could be called general intelligence as the information needed to draw the conclusion does not exist within the provided text.

Their model achieves 71% accuracy (50% would be random guessing), what would have been a SOTA result for this particular dataset if Niven and Kao had not decided to deem their results invalid.

The reason was that the model had learned to utilize spurious statistical patterns in the dataset that did not have anything to do with understanding the underlying reason. Ablations revealed that the presence of words such as “not”, “do” or “is” were the determining factor in the models “reasoning”. A great example of the Clever Hans effect. Creating a simple adversarial dataset where these cues were not informative brought the performance of our beloved BERT back down to random.

So, even though BERT has mastered the syntax of our language is there still lots for it to still learn! Research within NLP goes on to fight another day 💪🏼.

If you found this summary helpful in understanding the broader picture of this particular research paper, please consider reading my other articles! I’ve already written a bunch and more will definitely be added. I think you might find this one interesting 👋🏼🤖

--

--

Viktor
DAIR.AI

Learning to write and writing to learn. Staying on top of current NLP research through sharing what I find interesting 🤖