Transformer -Attention矩陣運算篇

任書瑋

Published in

Data Scientists Playground

7 min readDec 26, 2019

Transformer裡的Attention矩陣運算, 額外的mask操作可以看另外的兩篇文章

Transformer -encoder mask篇

這篇會著重介紹實際使用Transformer Encoder時會遇到的序列長度問題, 也就是mask處理, 不過在文章的開頭還是會簡單介紹一下Transformer…

medium.com

Transformer -decoder mask篇

接續上篇的Transformer -encoder mask篇, 這裏繼續講解mask如何運作在Transformer -decoder中, 文章一開頭一樣會先對Transformer -decoder做個簡單介…

medium.com

Attention可以看作將一個query和一組Keys-Values對映射為一输出z的過程。輸出是由各權重與各value相乘後相加起来得到的，每个value的權重是根據query和Keys通过一个函數計算出来的。

在self Attention 裡會先由同一個序列[x1, x2]經過三種不同的轉換[WQ, WK, WV]得到一組Queries[q1,q2], Keys[k1,k2], Values[v1,v2]

圖片出處http://jalammar.github.io/illustrated-transformer/

從以下的例子有一組Queries, Keys, Values, 序列長度皆為2, 序列裡的vector 長度為3

那首先q1對Keys[k1, k2]計算, 得到Attention向量[0.88, 0.12], 再將0.88*v1 + 0.12*v2 得到z1, 重複此操作q2對Keys[k1, k2]計算Attention向量後得到z2

至於 multiheaded attention就是把上述的事情變成多個head 並同時計算, 下圖展示了head 為 2時的例子, 每個head的大小為3, 我們會對每個head做相同的計算方式

而每組head 都會產生一組zi, 我們將所有的zi拼接起來, 並在做一次轉換WO使得最終的z回歸到我們希望的維度z, 下圖展示了head為8的例子

以下是tensorflow 1.14的代碼

'''
Args:
  Q: A 3d tensor with shape of [N, T_q, C_q].
  K: A 3d tensor with shape of [N, T_k, C_k].
  V: A 3d tensor with shape of [N, T_k, C_k].
  size_per_head: An int. Number of size per head.
  num_heads: An int. Number of heads.
Returns
  tensor with shape of (N, T_q, C_q)
'''
T_q = tf.shape(Q)[1]  #tensor
T_k = tf.shape(K)[1]  #tensor
C_q = Q.get_shape().as_list()[-1]  #int
# Linear projections
Q = tf.layers.dense(Q, num_heads * size_per_head, activation=tf.nn.relu)  # (N, T_q, num_heads * size_per_head)
K = tf.layers.dense(K, num_heads * size_per_head, activation=tf.nn.relu)  # (N, T_k, num_heads * size_per_head)
V = tf.layers.dense(V, num_heads * size_per_head, activation=tf.nn.relu)  # (N, T_k, num_heads * size_per_head)
# reshape and transpose
Q = tf.reshape(Q, (-1, T_q, num_heads, size_per_head))  # (N, T_q, num_heads, size_per_head)
Q = tf.transpose(Q, [0, 2, 1, 3])  # (N, num_heads, T_q, size_per_head)
K = tf.reshape(K, (-1, T_k, num_heads, size_per_head))
K = tf.transpose(K, [0, 2, 1, 3])  # (N, num_heads, T_k, size_per_head)
V = tf.reshape(V, (-1, T_k, num_heads, size_per_head))
V = tf.transpose(V, [0, 2, 1, 3])  # (N, num_heads, T_k, size_per_head)
# Multiplication
A = tf.matmul(Q, K, transpose_b=True) # (N, num_heads, T_q, T_k)
# scale, dk=C_q
A /= tf.sqrt(float(C_q))
A = tf.nn.softmax(A, name='softmax', axis=-1)outputs = tf.matmul(A, V)  
# (N, num_heads, T_q, T_k) * (N, num_heads, T_k, size_per_head) = (N, num_heads, T_q, size_per_head)
outputs = tf.transpose(outputs, [0, 2, 1, 3])  # (N, T_q, num_heads, size_per_head)
outputs = tf.reshape(outputs, (-1, T_q, num_heads * size_per_head))  # (N, T_q, num_heads*size_per_head)
if num_heads * size_per_head != C_q:
  outputs = tf.layers.dense(outputs, C_q)
return outputs, A

Reference

The Illustrated Transformer

Discussions: Hacker News (65 points, 4 comments), Reddit r/MachineLearning (29 points, 3 comments) Translations…

jalammar.github.io