Good Article Translation and Sharing — Attention Model

13 min readMar 31, 2019

update:2019/06/22

This article is recommended by Pytorch Taiwan. The original language is Japanese, I’ll going to translate it into Chinese.

Original Article ( Post by Ryobot ): http://deeplearning.hatenablog.com/entry/transformer

Chinese version:

大家好，我是Ryobot。
本文要講的是一個只使用Attention，沒有用 RNN 或 CNN 的神經機器翻譯器(neural machine translator)(Transformer)。藉由少量的訓練資料及聰明的title collection，就可以達到壓倒性的State-of-the-art的成效。我們將Attention generalize成簡單的數學式，並分成 additive Attention、dot-product Attention、source-target Attention及self-Attention這幾項。其中，Self-Attention是一種versatile且強大的方法，可以被用在其他的neural network上。

WMT’14的BLEU分數目前排名第一

上圖的左邊是encoder，右邊是decoder，不管是encoder或decoder，都堆疊了6個灰色的這種block而成(N=6)。
Encoder: 由6個 [self-Attention, feed-forward network] 這樣的block堆疊而成
Decoder: 由6個 [self-Attention with masking, source-target-Attention, feed-forward network]這樣的block堆疊而成
每一個block內也都使用了Residual Connection和 Layer Normalization

在進入詳細解說Transformer前，我想再多探究什麼是Attention。

Attention 是字典物件(dictionary object)

一般的encoder-decoder Attention是由encoder的hidden layer當Source、decoder的hidden layer當Target，並可以表示為下列式子：

更廣泛的來說，Target被視為query，而Source可以分解成Key和Value:

接下來，小寫開頭的query, key, value (or q, k, v)代表vector；大寫開頭的Query, Key, Value( or Q, K, V)代表array。

Key和Value都是array，array內的每個key和value都有對應的"key-value pair"，具有字典物件(dictionary object)的功能。
query和Key的內積是計算query和各個key的相似度，而用softmax正規化過的值(Attention weight)代表著"對應某個query的key"的位置；Attention weight和Value做內積的這個動作則代表，將這些和key有位置對應關係的value做了weighted sum後取出。

換句話說，Attention的動作就是將對應到query的key索引出來，然後再取出這個key的value，這樣的做法和字典物件(dictionary object)一樣。例如，一般的Encoder-Decoder Attention是，從所有的encoder的hidden layer (Value)取出和query關聯的value值。
給定query的matrix (Query )，則從key-value pair的matrix取出相同數量的value。

將Memory分成 Key和Value的意義

key-value pair array最早出現在End-To-End Memory Network (Sukhbaatar, 2015)這篇，Key是Input，Value是Output，兩個合起來稱為Memory，這時還不是所謂的字典物件(dictionary object)。

最早開始有字典物件(dictionary object)概念的是 Key-Value Memory Networks for Directly Reading Documents( Miller, 2016)這篇。

在 Key-Value Memory Networks這篇說明了將文章(例如: knowledge base and literature)以key-value pair方式儲存的這種常見技巧，藉由將Memory分成Key和Value，可以高度體現了key和value間的non-trivial變換。這裡所說的non-trivial變換並不是「輸入key就給出value的學習器」那種，而是一種複雜的變換。

在這之後，在(Daniluk, 2017)這篇也將相同的方法用在language model上。

Additive Attention和Dot-product Attention

用「如何求得Attention weight」這一點將Attention分成additive Attention和dot-product Attention。

Additive Attention [Bahdanau, 2014]的Attention weight是透過把query和key通過一層feed forward隱藏層後得到。
Dot-Product Attention, Multiplicative Attention [Luong, 2015]的Attention weight是由內積計算出來。因為計算內積不需要參數，所以計算更快且memory效率更好。Transformer用的就是這種。

Source-Target Attention及Self-Attention

用「input從哪裡來」這一點將Attention分成source-target Attention和self-Attention。

Source-Target-Attention: Key和Value來自encoder的hidden layer(Source)，Query來自decoder的hidden layer(Target)，這也是一般所指的Encoder-Decoder Attention。如果把這裡的Source當成Memory，那麼Key和Value就是從Memory分成2份。

Self-Attention: Query,Key,Value都來自同樣的地方(Self)，例如，encoder的Query,Key,Value都來自於前一層 hidden layer的輸出。

為了得到output的位置，self-Attention也可以參考前一層hidden layer的所有位置，這也是贏過Convolution的地方，Convolution只能參考local position。

在傳統的Attention model中，允許一次只給出一個query(例如，RNNSearch和 MemN2N)，然而，如果decoder的Query在同時間給出，或是同時執行"可接受同時給Query"的self-Attention model，會得到和query相同數量的output。

Transformer

這個model相對簡單。
Encoder: 由6個 [self-Attention, feed-forward network] 這樣的block堆疊而成
Decoder: 由6個 [self-Attention with masking, source-target-Attention, feed-forward network]這樣的block堆疊而成
“work sequence長度 x word dimension”的matrix代表Network內的feature，除了Attention的layer外，0 level(scalar)的word用Batch learning的方式一個一個分開處理。

訓練的時候不使用autoregression，全部的target word一起輸入、一起預測，但是在預測前，不會洩漏給decoder要預測的target word information，所以這個self-Attention會帶有mask的動作(ie, Masked Decoder)，在evaluate或inference的時候會使用autoregression產生word sequence。

Scaled Dot-Product Attention

在transformer的dot-product Attention稱為scaled dot-product Attention，和普通的dot-product Attention一樣，從原本的key-value pair取出value後做weighted sum，只是Q和K的內積會除上scaling factor(根號dk)。

而且，query array會以Q matrix的方式同時計算其dot-product Attention(如往常一樣，key和value的array會成為K, V matrix)，scaled dot-product Attention的公式如下:

當scaling factor(根號dk)值很小的時候，dot-product Attention功能就跟additive Attention 一樣；但如果是scaling factor(根號dk)值很大的時候，additive Attention會表現得比較好，原因是當scaling factor(根號dk)值變得太大的時候，經過softmax的back propagation的slope會變得很小。

在decoder預測前，為了不讓所要預測的word information洩漏給decoder，我們在self-Attention加了mask，送進softmax前，對應到autoregression所要預測的這些位置，用1填滿。

Multi-head Attention

Transformer中的scaling dot-product Attention是single head，而在Multi-head Attention是use multi-head in parallel。因為head數量(h=8)和每個head的 dimension(dmodel/h=64)是trade-off的關係，所以不管有多少head，parameters的數量都一樣。

在這裡，不是用dimension=512的Q, K, V做單一Attention，而是把Q, K, V線性投射到不同的dimension h次(h=8，所以dimension就變成了64)，然後分別在不同的weight下去計算dot-product Attention。將每一個dot-product Attention的output concatenate起來、乘上權重(Wo)、線性投射回512 dimension。Multi-head Attention公式如下:

在這裡，所有layer的output是512 dimension，Q、K和V是64 dimension。

經過實驗我們發現multi-head Attention的performance比single-head Attention好，因為multi-head Attention各自的head可以在其subspace的不同位置去處理，而在single-head則是用addition去處理。

Position-wise Feed-Forward Network, FFN

FFN，就如其名稱所述，獨立處理word sequence的每一個position。FFN的公式如下:

由"ReLU activated dimension=2048的中間層"及"dimension=512的output"，形成一個2層的fully connected layer。

Positional encoding

因為transformer不使用RNN或CNN，所以要處理sequence of word時，我們要加入word order的information，例如單字(word)的相對位置或絕對位置。使用的方式就是在input padded matrix的每一個element加入positional encoded matrix PE，PE的每一個component公式如下: