論文分享｜Multi-Head Attention:Collaborate Instead of Concatenate

Tsung-Yi, Kao

Published in

IM日記

13 min readJul 8, 2020

此篇是2020年6月發出的論文，程式碼是公開的：https://github.com/epfml/collaborative-attention

Introduction
Multi-Head Attention
Improving the Multi-Head Mechanism
Experiments
Conclusion

1. Introduction

在Attention以及Transformer的提出後，Multi-Head Attention(MHA)已經成為很重要的模塊。但是，對Transformer的運作，人們只有非常少的理解。

事實上，這個MHA的模塊是非常重經驗法則的，而且對每個不同的任務，需要非常不同的參數設定。目前有很多的研究顯示，不是所有的Head都是能夠提供同等資訊量的，有一些Head是能夠被拿掉且不影響準確率的。例如：Voita et al. (2019) 、Michel et al. (2019) 。另一方面 Cordonnier et al. (2020) 證明MHA對self-attention 在convolution的運作上是很重要的。

本篇論文是要觀察所有的Heads中是否有一些Heads學到重複的資訊。作者發現有些key/query projected 的維度中，有重複的部分，所以在所有Heads concatenated的時候會重複抓到差不多的pattern。

Contribution：

Introducing the collaborative multi-head attention layer
Re-parametrizing pre-trained models into a collaborative form renders them more efficient
Side contribution：identify a discrepancy between the theory and implementation of attention layers and show that by correctly modeling the biases of key and query layers, we can clearly differentiate between context and content-based attention.

2. Multi-Head Attention

在這個部分，會複習MHA的機制和機制中content(內容)和context(語境)的概念。

首先是Attention：

如果公式當中的X和Y是相同的，代表是用attention在同一個sequence，這就是在”Attention is all you need”中的Self-attention。

但我們在實際使用Attention時，是會考慮Biases的，所以實際上Q(query)和K(key)的相乘會加上bias項：

我們可以注意到，對所有entries的同個row，公式(3)的最後兩項會是constant的。而公式(3)的context項是所有key和query pairs的attention，content項是單獨key content的attention。而由公式可以發現，bK項（key 的 bias）是沒有貢獻任何有用資訊的，所以可以拿掉。

最後就是Multi-Head Attention：

其實就是將N個Heads 利用”Wo” concat起來。

以上就是對Attention的簡單複習，如果想了解更多，可以閱讀”Attention is all you need”這篇論文。

3. Improving the Multi-Head Mechanism

As the multiple heads are inherently solving similar tasks, they can collaborate instead of being independent.

作者提出這個想法，認為既然這些Heads都在解決相似的問題，那麼他們之間應該要合作。

作者們提出一個假設，他們覺得有些Heads會學到相似的特徵，像是都取到句子中的動詞，或是都取到Position encoding的某些維度。

How much do heads have in common? 作者提出了這個疑問

單看query 或 key matrices 間的similarity是不夠的，假設不同的head attend 到相同的東西，但是他們可能會有orthogonal (垂直)的 column-spaces ，這樣算相似度時可能會得出不相似的分數，但實際上兩個head是attend到相同東西。

為了證明單看query 或 key matrices 間的similarity是不行的，所以作者用下列方式證明：

To illustrate this issue, consider the case where two heads are computing the same key/query representations up to a rotation matrix R.

Even though the two heads are computing identical attention scores,they can have orthogonal column-spaces：

所以要看key and query的product！

(圖一)顯示出只需要大概 1/3 的維度就能夠capture幾乎所有WQ*WK 的energy，代表目前的Multi-head attention 是沒效率的。

Figure 1 shows the captured energy by the principal components of the key, query matrices and their product.
The eigenvalues represent the distribution of the source data’s energy among each of the eigenvectors, where the eigenvectors form a basis for the data.

再來，就到了這篇論文的重點：

Collaborative Multi-Head Attention

Following the observation that heads’ key/query projections learn redundant projections, we propose to learn key/query projections for all heads at once and to let each head use a re-weighting of these projections.

作者認為：Key&Query對所有Heads的projection應該要一次全部一起學，然後重新更新參數。

可以從公式(7)看到，他們並不是將Key&Query複製到所有Heads，而是學一個Mixing vectors “mi”。

用這個方法，有兩個好處：

因為學了這個mixing vectors，heads能夠使用更多或更少的維度去表達，讓每個head有更強的表示能力。
因為projections 是 shared between heads的, 能夠讓stored and learned only once，參數的表示更有效率！

(圖二)表示增加了mixing vectors mi之後，原本fix的dk，就可以更大或更小，讓各個head有更豐富的表現能力。

除了能夠自己學這個mixing vectors，也能夠有一個方法能夠直接套用collaborative attention到現有的pre-trained attention layer：

Head Collaboration as Tensor Decomposition

There is a simple way to convert any standard attention layer to collaborative attention without retraining. -> Tucker tensor decomposition (Tucker, 1966)

Tucker tensor decomposition示意圖：

簡單來說，就是將一個tensor，拆解成一個“核”與三個metrics的相乘：

本篇論文，是使用Tucker 分解的變形：CP分解，CP分解是Tucker分解的一種特殊形式：如果核心張量是對角的，且P=Q=R，则Tucker分解就退化成了CP分解(公式11)：

作者使用CP分解，將Query 和 Key的乘積拆解為三個矩陣：

M就是作者提出的mixing matrix，後面兩個矩陣就是key和query 的projection matrix。

另一方面，前面公式(3)所說到的bias項，可以簡單地被處理成：

將每個Head已經訓練好的參數儲存起來，就能夠得到bias項！

所以呢，整個對第i個head的re-parametrization，就能夠變成以下公式：

這個re-parametrization可以直接套在現有的pre-trained Transformer architectures中的attention layers，像是BERT，且不需要重新pre-trained 。

Parameter and Computation Efficiency

Table 1的train就是自己訓練mixing vectors，re-param是套用在現有的model。

FLOPS是每秒浮點運算次數（亦稱每秒峰值速度）是每秒所執行的浮點運算次數（英語：Floating-point operations per second；縮寫：FLOPS）的簡稱，被用來估算電腦效能，尤其是在使用到大量浮點運算的科學計算領域中。

4. Experiments

本篇做了兩個實驗，第一個是要實驗作者提出的collaborative MHA是能夠直接取代concatenation-based MHA -> NMT(翻譯任務)；第二個實驗是對現有的pre-trained model，像是BERT做re-parametrize -> Natural Language Understanding (NLU) tasks。

實驗一：

用encoder-decoder transformer的架構，但是用collaborative MHA取代concatenation-based MHA，然後使用 WMT 2017 English-to-German translation task。

可以看出將Dk維度縮小8倍，還能夠保持一樣的BLEU分數，繼續縮小，BLEU分數也降低不超過1。

實驗二：

實驗是使用GLUE task (Wang et al., 2018)，而每個GLUE task會有三步驟：

首先，拿一個pre-trained transformer，然後fine-tune模型。
第二步, 用所提出的collaborative MHA 來取代attention layers，並使用 tensor decomposition 來計算~WQ, ~WK 和 M 以及 re-parametrize the biases into v(公式12).
最後，再對模型進行一次fine-tune，之後就能夠去評估模型的表現。

(圖四)是上述第二步的運行時間。由Table 2 能夠看出，Dk降到1/3時，模型的表現都能過降低不超過1.5%。

額外實驗：

作者想要看看，在實驗二中的步驟三-fine-tune是不是真的有需要。

結果顯示當compression小於1/3時，fine-tuning是不太需要的。但是再繼續壓縮，會影響準確率，而在壓縮到2/3之前都能夠用再次的fine-tuning 拉回原本的準確率。

5. Conclusion

顯示了原本的concatenation-based MHA會學到太重複的key/query representation。
提出了collaborative MHA，能夠取代原本的concatenation-based MHA。
使用encoder-decoder transformers在NMT任務上時，使用collaborative MHA，能夠將每個head的size從64降低到8，且不影響準確率。
能夠直接使用在pre-trained 的transformer上。
本篇論文的code是公開的，網址是：https://github.com/epfml/collaborative-attention
提供一個能夠pre-trained更快的方法，並且增加attention 機制的解釋能力。

Reference

論文的arXiv網址：https://arxiv.org/abs/2006.16362

public code：https://github.com/epfml/collaborative-attention