Review: Attention is all you need

Guan

Published in

工人智慧

12 min readOct 9, 2020

Attention Is All You Need

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an…

arxiv.org

作為 Transformer 的濫觴，2017 年發表時，其轟動大概能相較於 CV 中的 ResNet。今日來看，Transformer 似乎有過之而無不及，其隨後的 Bert 和 GPT 等大大開拓了人們對於 Sequence Tasks 的想像，現代的 NLP 大概很難不提到 Transformer ，所以今天就從 Transformer 源頭之一 — Attention is all you need 講起吧。

Introduction

在 Transformer 前的 LSTM-based 模型最致命的弱點是無法進行平行運算，由於 t 時刻的狀態是 t-1 時刻的函數，LSTM 天生就是 Sequential Computing ，難以進行平行化，使得使用大型 LSTM 模型、大 batch size 或長序列的訓練任務上，訓練的時間令人難以接受。

除此之外，Transformer 和 LSTM-based 的模型相較起來，輸入和輸出序列中前後的距離對於後者非常重要，越遙遠的 element 之間越難被 LSTM 連結，而前者的 Attention 機制正是要盡可能地將 global dependencies 學習起來，在輸入與輸出的關鍵字的位置距離十分遙遠時也有好的表現。

這兩點在後面的章節會有詳細的解說。

另外，在提出 Attention Model 的前作 NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE 點明了 RNN Encoder-Decoder (Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation) 中架構的不足，其 hidden layer 對於單一 sequence 輸入僅有單一固定長度的 context vector ，而 Attention 則會對 sequence 中每一 token 都產生 context vector，使資訊更能夠被 Encoder 保留，本篇自然也承襲這樣的優點。

Model Architecture

Transformer 由 Encoder 及 Decoder 組成，兩者的性質迥異，應該分開理解。前者用於壓縮序列資訊，後者則用於將前者抽取的資訊轉換（解壓縮）為任務所需要的資訊。

一個 Encoder Layer 主要有下列操作：

Muti-Head Attention

輸入一個矩陣有 shape 即 (batch_size, seq_len, num_feature)，為方便說明假設為 (8, 64, 512) 。

先將輸入矩陣乘以 n_head 個 W^Q, W^K, W^V ，其 shape 為 (8, 512, 512 / n_head )，故得到有 n_head 個 Q, K, V ，每一個 shape 皆為 (8, 64, 512 / n_head)。

其中 Q 及 K 先行相乘，得到 n_head 個矩陣為 shape (8, 64, 64)，再進行 element-wise scale 及 softmax，shape 不變。

最後和 V (8, 64, 512 / n_head) 相乘，得到 n_head 個輸出矩陣 (8, 64, 512 / n_head)，最終將其 concatenate ，得到最終輸出 (8, 64, 512)，與輸入矩陣 shape 相同。就算 shape 不同，也可以再乘以 W^O 做 upsampling or downsampling。

Padding Mask

在 NLP 當中的常用 Mask，用於 mini-batch 當中，帶有不同長度的 sequence 的時候，譬如說十句不同長短的句子。通常做 zero-padding。

Add (residual connection) and LayerNorm

由於輸入與輸出擁有相同的 shape ，我們可以使用 residual connection 將輸出加上輸入。並使用 LayerNorm 在每一個 instance 的方向上進行 normalization。

有趣的是，有不少論文探討 LayerNorm 在 Transfomer 架構中的影響，如 On Layer Normalization in the Transformer Architecture 討論了此架構對訓練時 Gradient ，以及Pre-layernorm (norm before residual)和 Post-layernorm (本文架構) 對 warm-up 與否訓練時的影響，並指明前者不需要 warm-up，且效果更好。

FeedForward

對每個 position 的 embedding 進行操作，假設輸入為 (8, 64, 512) ，以 ( 512, 2024) 全連接層相乘，得到輸出 (8, 64, 2024) ，進行 relu 運算，再與 (2024, 512) 全連接層相乘，得到輸出 (8, 64, 51參見[編輯]2)。

Position Encoding

在 Encoder-Decoder 操作中不存在 Convolution 或 Recurrence ，只有dot-product，前者能夠使用矩陣的 spatial information ，後者則是 order of sequence 的訊息，如果將輸入矩陣沿著維度 1 做 random shuffle，得到的 Attention 結果會是一樣的。所以我們需要 “inject” 資訊到輸入的矩陣中，使訊息直接被儲存在 embedding vector 的值當中。即：

pos 為在該元素在序列中的位置，偶數位使用 sine 而奇數位使用 cosine。i 則為在維度中的 index 。假設一輸入序列有 shape (1, 64, 512) ，矩陣中在 (1, 32, 120) 的值會被加上 sin(32 / 10000^(240/512)) = 0.41389…

之所以使用這樣的形式使因為正餘弦在 k steps 以外會擁有相同的值，及 PE_pos = PE_(pos + k)，作者認為這種形式會使模型更容易理解 position encoding，也確實某種程度符合我們對序列資料建模的認知。

以上為 Encoder Layer 的操作，Decoder Layer 有著幾乎一樣的操作，除了 Scaled Dot-Product 中 Mask 的部份。

Sequence Mask

也是在 NLP 當中的常用 Mask。在 Scaled Dot-Product 中，Decoder 在 scaled 後必須還要經過 mask 操作，這樣的用意是，序列中第 i 個元素的，僅能仰賴 i-th 以前的資訊做預測，不讓 Decoder 看見未來的訊息，所以我們必須將包含 i + 1 之後的 vectors 掩蓋起來。

假設我們有序列 [1, 2, 3, 4] ，故 scale 後會有一 4 × 4 的矩陣。為了要讓 Decoder 在見到 1 時預測 2, 3, 4 ，我們應該要將 (0, 1), (0, 2), (0, 3) 掩蓋。
見到 2 時，(1, 2), (1, 3) 應該要被掩蓋，以此類推。最終我們會得到一個上三角形，即為 Sequence Mask。

之後我們知道，除了這種中規中矩的使用，之後的 BERT 更直接隨機 mask out tokens 作為 pre-trained task，也就是 MLM 。近日，It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners 更讓 MLM 大放異彩。

Why Self-Attention

總結 Self-Attention 優於傳統 Encoder or Decoder 的理由：

Less computation and easier to parallelize

同 Introduction 所述，LSTM 必須要知道 t-1 時刻的狀態，無論是 c, h 或 x，才能計算 t 時刻，這樣的演算法本身就是 sequential ，無法平行化。但 Attention 本質上只是矩陣運算，很容易能夠做到軟體和硬體的最佳化。

除此之外，per-layer 的運算也較 convolution 和 recurrent 要少。但我認為這邊單純只比較 per-layer 的運算太過獨斷，應該也要比較 efficiency 才對，efficiency / complexity 才會是有效的比較。

Short path to any positions in sequence

我們曾在 Introduction 提到，Attention 更能夠學習到 global dependencies，因為任一 token 到另一 token 的「距離」非常短。

何謂距離？我個人理解為 embedding vector1 與 embedding vector2 使用該操作 encode 兩者的交互訊息時。

譬如 (8, 64, 64) 在 convolution 上，當 k = 2, s = k 不使用 padding 時，最多需要 6 次 convolution 操作才能確保序列頭尾的訊息被 encode 到 convolution。

而 LSTM 對於 (8, 64, num_feature)，則需要 64 次的操作才將序列頭尾的資訊 encode。

在 Attention Layer 中，由於是 dot-product ，(8, 64, 64) 的輸入 V，與 (8, 64, 64) 的 QK^T 直接相乘，所以在 QK^T (0, 63) 的這個值即 encode 序列頭尾的交互訊息了。

More interpretable

事實上，我是由電腦視覺中的 channel-wise attention 認識 Self-Attention 機制的，應該是透過 Dual Attention Network for Scene Segmentation。其哲學很簡單，channels 之間必定存在聯繫，透過 self-attention 計算各個 channel 之間的聯繫並得出新的 channels。

就這樣的理解，QK^T 得出的是各個 tokens 之間的加權，視任務需要去賦予定義。如翻譯任務或文意理解，加權可以被解釋為相似性；對於 Q/A 任務，加權是對於輸出答案的幫助程度。所以我們當然可以透過 QK^T 的值做更多的 interpretation。

Self-Attention Thoughts Under the Hood

如上段理解，Query 與 Key 的作用為決定 Value 的權重，而 Key 與 Value 組成一個 token(註)。當 Query 與 Key 在該任務上有強烈的關聯，Key 所對應的 Value 就會被放大。

以此來理解 Decoder 中，Encoder 的輸出作為 Query 及 Key ，而經 Decoder 處理的輸入作為 Value ，可以解釋為，Encoder 的任務為決定哪些 token 對任務更有幫助，而 Decoder 則決定該 token 上 context vector 值的大小。

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key.

Reference

[1] Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translationr. arXiv:1406.1078v3 (2014).

[2] Neural machine translation by joint learning to align and translate. arXiv:1409.0473v7 (2016)

[3] Seq2seq pay Attention to Self Attention: Part 2(中文版), credits to Ta-Chun (Bgg/Gene) Su

本篇之 Transformer 僅僅是開始，其後的 Bert, GPT 和 Transformers review等文都值得一讀。

而在我寫下這篇文的一個禮拜前，ICLR 2021 已經將其正在進行 double-blind review 的論文放出，其中的 AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE 號稱能夠使用 Attention 機制就勝過傳統 Convolution-based CV model。是否真的 “Attention is all you need” ？值得拭目以待。