Transformer -Attention矩陣運算篇

任書瑋
Data Scientists Playground
7 min readDec 26, 2019

Transformer裡的Attention矩陣運算, 額外的mask操作可以看另外的兩篇文章

Attention可以看作將一個query和一組Keys-Values對映射為一输出z的過程。輸出是由各權重與各value相乘後相加起来得到的,每个value的權重是根據query和Keys通过一个函數計算出来的。

圖片出處https://arxiv.org/abs/1706.03762

在self Attention 裡會先由同一個序列[x1, x2]經過三種不同的轉換[WQ, WK, WV]得到一組Queries[q1,q2], Keys[k1,k2], Values[v1,v2]

圖片出處http://jalammar.github.io/illustrated-transformer/

從以下的例子有一組Queries, Keys, Values, 序列長度皆為2, 序列裡的vector 長度為3

圖片出處http://jalammar.github.io/illustrated-transformer/

那首先q1對Keys[k1, k2]計算, 得到Attention向量[0.88, 0.12], 再將0.88*v1 + 0.12*v2 得到z1, 重複此操作q2對Keys[k1, k2]計算Attention向量後得到z2

圖片出處http://jalammar.github.io/illustrated-transformer/

至於 multiheaded attention就是把上述的事情變成多個head 並同時計算, 下圖展示了head 為 2時的例子, 每個head的大小為3, 我們會對每個head做相同的計算方式

圖片出處http://jalammar.github.io/illustrated-transformer/

而每組head 都會產生一組zi, 我們將所有的zi拼接起來, 並在做一次轉換WO使得最終的z回歸到我們希望的維度z, 下圖展示了head為8的例子

圖片出處http://jalammar.github.io/illustrated-transformer/

以下是tensorflow 1.14的代碼

'''
Args:
Q: A 3d tensor with shape of [N, T_q, C_q].
K: A 3d tensor with shape of [N, T_k, C_k].
V: A 3d tensor with shape of [N, T_k, C_k].
size_per_head: An int. Number of size per head.
num_heads: An int. Number of heads.
Returns
tensor with shape of (N, T_q, C_q)
'''
T_q = tf.shape(Q)[1] #tensor
T_k = tf.shape(K)[1] #tensor
C_q = Q.get_shape().as_list()[-1] #int
# Linear projections
Q = tf.layers.dense(Q, num_heads * size_per_head, activation=tf.nn.relu) # (N, T_q, num_heads * size_per_head)
K = tf.layers.dense(K, num_heads * size_per_head, activation=tf.nn.relu) # (N, T_k, num_heads * size_per_head)
V = tf.layers.dense(V, num_heads * size_per_head, activation=tf.nn.relu) # (N, T_k, num_heads * size_per_head)
# reshape and transpose
Q = tf.reshape(Q, (-1, T_q, num_heads, size_per_head)) # (N, T_q, num_heads, size_per_head)
Q = tf.transpose(Q, [0, 2, 1, 3]) # (N, num_heads, T_q, size_per_head)
K = tf.reshape(K, (-1, T_k, num_heads, size_per_head))
K = tf.transpose(K, [0, 2, 1, 3]) # (N, num_heads, T_k, size_per_head)
V = tf.reshape(V, (-1, T_k, num_heads, size_per_head))
V = tf.transpose(V, [0, 2, 1, 3]) # (N, num_heads, T_k, size_per_head)
# Multiplication
A = tf.matmul(Q, K, transpose_b=True) # (N, num_heads, T_q, T_k)
# scale, dk=C_q
A /= tf.sqrt(float(C_q))
A = tf.nn.softmax(A, name='softmax', axis=-1)
outputs = tf.matmul(A, V)
# (N, num_heads, T_q, T_k) * (N, num_heads, T_k, size_per_head) = (N, num_heads, T_q, size_per_head)
outputs = tf.transpose(outputs, [0, 2, 1, 3]) # (N, T_q, num_heads, size_per_head)
outputs = tf.reshape(outputs, (-1, T_q, num_heads * size_per_head)) # (N, T_q, num_heads*size_per_head)
if num_heads * size_per_head != C_q:
outputs = tf.layers.dense(outputs, C_q)
return outputs, A

Reference

--

--