Introduction of Self-Attention Layer in Transformer

Neil Wu
LSC PSD
Published in
5 min readOct 3, 2019
Every articles about Transformer, start with transformer

Brief of Transformer

The name 「Transformer」in field of Natural Language Processing(NLP) is defined by a paper published by Google named “Attention is all you need” in Mid 2017. In short, the concept of Transformer is about replacing Recursive or Convolutional Neuron Layer by Self-Attention Layer.

Since then, basically every works done in filed of NLP is re-worked by Transformer. And unsurprisingly, they outperform previous results. Transformer published in 2017, and still dominating this field now. To be specifically, the state-of-the-art NLP model BERT published in 2018 is simply unsupervised trained transformer, and GPT2, the AI that’s too dangerous to release, in 2019 is BERT using Masked Self-Attention Layer.

Transformer is literally the holy-grail of NLP now, it even successfully re-defined the word attention after Charlie Puth.

Attention is not only all you need, is officially all that matters (2019 now)

Self-Attention

Attention-based mechanism is published at 2015, originally work as Encoder-Decoder structure. Attention is simply a matrix showing relativity of words, details about attention check the article wrote by Synced below:

Self-Attention is compression of attentions toward itself. The main advantages of Self-Attention Layer compares to previous architectures are:

  1. Ability of parallel computing (compares to RNN)
  2. No need of deep network to look for long sentence (compares to CNN)

To speak correctly, Convolutional Neural Network(CNN) based NLP is also solutions for Recursive Neural Network based one. However the disadvantages of CNN are vivid, its not suitable for process long sentence.

Self-Attention Layer check attention with all words in same sentence at once, which makes it a simple matrix calculation and able to parallel computes on computing units. Also, Self-Attention Layer can use Multi-Head architecture mentioned below to broaden the vision (associated word’s distance in sentence).

Basic Structure

Self-Attention Layer accomplish attention with self by 3 parts

For every input x, the words in x are embed into vector a as Self-Attention input. Next, calculate Query, Key and Value of this self attention respectively.

  • q ⁱ (Query) = W ᑫ a ⁱ
  • k ⁱ (Key) = W ᵏ a ⁱ
  • v ⁱ (Value) = W ᵛ a ⁱ

W ᑫ, W ᵏ, W ᵛ is the target to train in a layer. Attention is combinations of Query and Key, where Query act as giver and Key act as receiver. Value can seem as information extractors, will extract a unique value based on attentions of words.

Attention

Step flow of calculating attentions

Attentions α are defined as inner product of Query(giver) and Key(receiver) divided by square root of its dimensions. Every word create its attention toward all words by providing Query to match Key which is target word of attention. Since longer sentence (more words) results in bigger number of inner products, the square root here act as a variance balance. Attention Matrix A will generated.

Step flow of calculating outputs of self-attention layer

Apply softmax function through A by row(input words). Output b is calculated by sum of attentions multiply by extract information from Value of each word paid attention to.

The miracle here is Self-Attention Layer achieved construct relationships and extract information by itself by 3 variance Query, Key and Value.

Query and Key construct the relationships, Value summarize all relations within and concludes a output b which contains relations between input x and all other words.

Matricize

All calculations happen in Self-Attention Layer is Matrix calculations

As we mentioned above, every calculations within Self-Attention Layer is Matrix Computations. Is extremely suitable for nowadays GPU computing.

Also, none of the information in calculations is from last time stamp, so its available for parallel computing.

Multi-Head

Since the relations between words may consist more than one types. For Instance, check the two sentences below:

“LSC is the best!”

“It’s the best of LSC.”

In these two sentences, the relation between “LSC” and “best” is quite difference, they shouldn’t be treated the same in attentions. And also, distance between words should considered by attentions too.

How Multi-Head creates different Attentions

Multi-Head is features that can create multiple Attentions Matrix in one layer. By simply double the Query, Key and Value combinations in Self-Attention Layer, and independently calculates Attention Matrix. With Multiple-Head, the Self-Attention Layer would create multiple outputs. Therefore, there will be another trainable weights Wᵒ, so that O = WᵒB where B is outputs of different Attentions.

Time Complexity

The time complexity of Self-Attention Layer is also having advantages. FLOPS comparison of different NLP structures showed below:

  • Self-Attention: O(length²•dim)
  • RNN(LSTM): O(length•dim²)
  • Convolution: O(length•dim²•kernel-width)

Therefore, Attention is cheap when sentence length <<dimensions.

Conclusions

Self-Attentions shows dominance the fields these years, and brought us to new era. And the concept of it is what I tried to introduce here. If anyone want more structural info, please read the original paper “Attention is all you need”. If you’re as lazy as me I’ll recommend article from Toward DataScience:

if you like(this_article):
please(CLAPS)
follow(LSC_PSO)
# Thanks :)

References:

  1. “Attention is all you need” Dec. 2017
  2. Transformer Lecture by Hung-Yi Lee, NTU EE (Youtube)
  3. NLP with Deep Learning, Stanford CS224N
  4. Self-Attention Mechanisms in Natural Language Processing, Alibaba Cloud
  5. 淺談神經機器翻譯 & 用 Transformer 與 TensorFlow 2 英翻中

--

--