Introduction of Self-Attention Layer in Transformer
Brief of Transformer
The name 「Transformer」in field of Natural Language Processing(NLP) is defined by a paper published by Google named “Attention is all you need” in Mid 2017. In short, the concept of Transformer is about replacing Recursive or Convolutional Neuron Layer by Self-Attention Layer.
Since then, basically every works done in filed of NLP is re-worked by Transformer. And unsurprisingly, they outperform previous results. Transformer published in 2017, and still dominating this field now. To be specifically, the state-of-the-art NLP model BERT published in 2018 is simply unsupervised trained transformer, and GPT2, the AI that’s too dangerous to release, in 2019 is BERT using Masked Self-Attention Layer.
Transformer is literally the holy-grail of NLP now, it even successfully re-defined the word attention after Charlie Puth.
Self-Attention
Attention-based mechanism is published at 2015, originally work as Encoder-Decoder structure. Attention is simply a matrix showing relativity of words, details about attention check the article wrote by Synced below:
Self-Attention is compression of attentions toward itself. The main advantages of Self-Attention Layer compares to previous architectures are:
- Ability of parallel computing (compares to RNN)
- No need of deep network to look for long sentence (compares to CNN)
To speak correctly, Convolutional Neural Network(CNN) based NLP is also solutions for Recursive Neural Network based one. However the disadvantages of CNN are vivid, its not suitable for process long sentence.
Self-Attention Layer check attention with all words in same sentence at once, which makes it a simple matrix calculation and able to parallel computes on computing units. Also, Self-Attention Layer can use Multi-Head architecture mentioned below to broaden the vision (associated word’s distance in sentence).
Basic Structure
For every input x, the words in x are embed into vector a as Self-Attention input. Next, calculate Query, Key and Value of this self attention respectively.
- q ⁱ (Query) = W ᑫ a ⁱ
- k ⁱ (Key) = W ᵏ a ⁱ
- v ⁱ (Value) = W ᵛ a ⁱ
W ᑫ, W ᵏ, W ᵛ is the target to train in a layer. Attention is combinations of Query and Key, where Query act as giver and Key act as receiver. Value can seem as information extractors, will extract a unique value based on attentions of words.
Attention
Attentions α are defined as inner product of Query(giver) and Key(receiver) divided by square root of its dimensions. Every word create its attention toward all words by providing Query to match Key which is target word of attention. Since longer sentence (more words) results in bigger number of inner products, the square root here act as a variance balance. Attention Matrix A will generated.
Apply softmax function through A by row(input words). Output b is calculated by sum of attentions multiply by extract information from Value of each word paid attention to.
The miracle here is Self-Attention Layer achieved construct relationships and extract information by itself by 3 variance Query, Key and Value.
Query and Key construct the relationships, Value summarize all relations within and concludes a output b which contains relations between input x and all other words.
Matricize
As we mentioned above, every calculations within Self-Attention Layer is Matrix Computations. Is extremely suitable for nowadays GPU computing.
Also, none of the information in calculations is from last time stamp, so its available for parallel computing.
Multi-Head
Since the relations between words may consist more than one types. For Instance, check the two sentences below:
“LSC is the best!”
“It’s the best of LSC.”
In these two sentences, the relation between “LSC” and “best” is quite difference, they shouldn’t be treated the same in attentions. And also, distance between words should considered by attentions too.
Multi-Head is features that can create multiple Attentions Matrix in one layer. By simply double the Query, Key and Value combinations in Self-Attention Layer, and independently calculates Attention Matrix. With Multiple-Head, the Self-Attention Layer would create multiple outputs. Therefore, there will be another trainable weights Wᵒ, so that O = WᵒB where B is outputs of different Attentions.
Time Complexity
The time complexity of Self-Attention Layer is also having advantages. FLOPS comparison of different NLP structures showed below:
- Self-Attention: O(length²•dim)
- RNN(LSTM): O(length•dim²)
- Convolution: O(length•dim²•kernel-width)
Therefore, Attention is cheap when sentence length <<dimensions.
Conclusions
Self-Attentions shows dominance the fields these years, and brought us to new era. And the concept of it is what I tried to introduce here. If anyone want more structural info, please read the original paper “Attention is all you need”. If you’re as lazy as me I’ll recommend article from Toward DataScience:
if you like(this_article):
please(CLAPS)
follow(LSC_PSO)
# Thanks :)