Do Attention Heads in BERT/RoBERTa Track Syntactic Dependencies?

  1. Maximum Attention Weights (MAX): For each token, we assign a dependency relation for each words based on the word with the highest incoming attention weight. We compute the accuracy based on correctly identifying the source/destination words for each dependency relation type.
  2. Maximum Spanning Tree (MST): Given the root of the dependency parse, we compute a full maximum spanning tree based on the attention matrix. We evaluate based on undirected unlabeled attachment score (UUAS) that computes the percentage of correct relations with respect to ground-truth parse tree.

Maximum Attention Weight Results

Figure 1: Undirected dependency accuracies by type based on the Maximum Attention Weight (MAX)

Maximum Spanning Tree Results

Figure 2: Maximum UUAS across layers of dependency trees extracted based on the maximum spanning tree (MST) method

Takeaways

--

--

--

PhD Candidate at NYU Data Science

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Building a Deep Image Search Engine using tf.Keras

Paper Summary: Attention Is All You Need

Building an on-device face mask detector with Fritz AI Studio

Feature generation from tweets

Introduction to YOLOv4: Research review

Ensemble Reinforcement Learning

Sudoku RNN in PyTorch

Target Encoding and Smoothing For Ternary Targets

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Phu-Mon Htut

Phu-Mon Htut

PhD Candidate at NYU Data Science

More from Medium

Making Sense of it all with NLP

Text Summarization, Part 1 — A gentle Introduction to Automatic Text Summarization

Experiments in compressing Sentence Embeddings

TEXT SUMMARISATION PART 2