Do Attention Heads in BERT/RoBERTa Track Syntactic Dependencies?

  1. Maximum Attention Weights (MAX): For each token, we assign a dependency relation for each words based on the word with the highest incoming attention weight. We compute the accuracy based on correctly identifying the source/destination words for each dependency relation type.
  2. Maximum Spanning Tree (MST): Given the root of the dependency parse, we compute a full maximum spanning tree based on the attention matrix. We evaluate based on undirected unlabeled attachment score (UUAS) that computes the percentage of correct relations with respect to ground-truth parse tree.

Maximum Attention Weight Results

Figure 1: Undirected dependency accuracies by type based on the Maximum Attention Weight (MAX)

Maximum Spanning Tree Results

Figure 2: Maximum UUAS across layers of dependency trees extracted based on the maximum spanning tree (MST) method





PhD Candidate at NYU Data Science

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Building a Deep Image Search Engine using tf.Keras

Paper Summary: Attention Is All You Need

Building an on-device face mask detector with Fritz AI Studio

Feature generation from tweets

Introduction to YOLOv4: Research review

Ensemble Reinforcement Learning

Sudoku RNN in PyTorch

Target Encoding and Smoothing For Ternary Targets

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Phu-Mon Htut

Phu-Mon Htut

PhD Candidate at NYU Data Science

More from Medium

Making Sense of it all with NLP

Text Summarization, Part 1 — A gentle Introduction to Automatic Text Summarization

Experiments in compressing Sentence Embeddings