Do Attention Heads in BERT/RoBERTa Track Syntactic Dependencies?

tl;dr: The attention weights between tokens in BERT/RoBERTa bear similarity to some syntactic dependency relations, but the results are less conclusive than we’d like as they don’t significantly outperform linguistically uninformed baselines for all types of dependency relations.

The strong performance on BERT/RoBERTa on a broad range of NLP tasks is well-established, and various works have investigated the cause of these good results. One promising approach has been to study the structure of attention weights in the self-attention layers in BERT/RoBERTa and mapping it to the syntactic structure of text. In our work, we are specifically interested in investigating how the Transformer attention weights capture dependency relations between words in a sentence. To do so, we extract implicit dependency relations from the attention matrices (methods detailed below), and compare them to the gold universal dependency parses from the CoNLL 2017 shared task.

Complicating this analysis is the fact that Transformer models contain multiple layers of multiple self-attention heads, making the extraction of meaningful dependency relations from attention weights non-trivial. To simplify the problem, we focus on extracting relations on a per-layer, per-head basis.

For each layer/head, we apply two different relation extraction methods:

  1. Maximum Attention Weights (MAX): For each token, we assign a dependency relation for each words based on the word with the highest incoming attention weight. We compute the accuracy based on correctly identifying the source/destination words for each dependency relation type.
  2. Maximum Spanning Tree (MST): Given the root of the dependency parse, we compute a full maximum spanning tree based on the attention matrix. We evaluate based on undirected unlabeled attachment score (UUAS) that computes the percentage of correct relations with respect to ground-truth parse tree.

The former investigates if there is some head in a given BERT model that corresponds to a specific dependency relation, whereas the latter investigates if the BERT attention heads are forming complete, syntactically informative parse trees. Note that the latter method does require that the root of the gold dependency parse tree to be provided.

In both cases, we report the highest score for each metric, across all layers/heads. The goal of this exercise is to determine if there is some structure in the attention weights of BERT models that map to our understanding of syntactic parsing. To that end, we compare our results to two baselines each for both methods. For MAX, for each corresponding dependency relation, we apply a fixed-position-offset baseline based on the most common relative position of a dependency arc for a given dependency relation (for example, amod tends to occur before a noun phrase), whereas for MST, we use a right-branching tree baseline. We also apply the same method to a BERT-large model with randomly initialized weights as an additional baseline for both.

We also investigate how fine-tuning on downstream tasks may influence the resulting extracted relations. We fine-tune the BERT-large model on two tasks separately: CoLA, a linguistic acceptability judgment which is more syntax oriented, and MNLI, an entailment task that is more semantically oriented.

Maximum Attention Weight Results

Figure 1: Undirected dependency accuracies by type based on the Maximum Attention Weight (MAX)

All pretrained models outperform the random BERT baseline, but do not consistently outperform the fixed-position baseline: In Figure 1 above, we show the accuracies the extracted dependencies relative to the gold parses for both extraction methods for a subset of dependency relations. In general, the pretrained models generally far outperform the random BERT baseline. However, the margin between the pretrained models and the positional-offset baseline is much tighter. For some dependency relations like nsubj and obj, the pretrained models still outperform the positional-offset baseline, but for others such as advmod and amod, the performance gap is much smaller. Nevertheless, the successful identification of certain heads that appear to capture specific dependency relations echoes the results from Clark et al., 2019 of the presence of highly specialized heads in BERT models.

Fine-tuned BERT models perform fairly similarly to vanilla BERT, with mild evidence for MNLI improving long-term dependencies: The difference between the vanilla BERT model and BERT fine-tuned on CoLA/MNLI tasks is relatively small, as shown in the same figure. One pattern we observe is that the vanilla BERT and CoLA-BERT models generally outperform the MNLI-BERT model, except advcl and csubj, which are long-distance dependencies. This implies that fine-tuning on a semantics-oriented task could strengthen the model’s sensitivity to long-distance dependencies, with the caveat that the delta is still fairly small from our results.

Maximum Spanning Tree Results

Figure 2: Maximum UUAS across layers of dependency trees extracted based on the maximum spanning tree (MST) method

All pretrained models again outperform the random baseline, but only barely outperform the simple right-branching baseline: In Figure 2 above, we compare the trees generated from our MST method (given the gold root) with the gold trees from the UD dataset. As before, we find that BERT soundly outperforms the random BERT baseline, but does only slightly better than the simple baseline of a right-branching tree. Likewise, we do not find meaningful variation in performance applying the method across the different layers of the BERT-style models. Given that MST method uses the root of the gold trees, whereas the right-branching baseline does not, this should be interpreted as a somewhat negative result for this method.


In the case of MAX, our results indicate that specific heads in the BERT models may correspond to certain dependency relations, whereas for MST, we find much less support “generalist” heads whose attention weights correspond to a full syntactic dependency structure.

In both cases, the metrics do not appear to be representative of the extent of linguistic knowledge learned by the BERT models, based on their strong performance on many NLP tasks. Hence, our takeaway is that while we can tease out some structure from the attention weights of BERT models using the above methods, studying the attention weights alone is unlikely to give us the full picture of BERT’s strength processing natural language.

Read our paper here for more details!

PhD Candidate at NYU Data Science

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store