Transformer Demystified|Break your Attention and build it back
Get familiar with ideas and operations that goes into Attention mechanism and Transformer as a whole. Build a prototype from scratch and start translating.
Introduction
Ashish Vaswani and the team from google published the paper ‘Attention is all you need’ in June 2017. They proposed a new language model, the model replaced recurrent or convolutional networks with Attention alone. This relatively simple model achieved better BLEU scores in English to French and English to German translation, at the same time training time was much shorter.
Key criticisms for RNN are poor at learning long term dependencies due to relatively inefficient gradient flow, the directional nature of RNNs inhibits them from using parallel computation. Attention successfully overcame the drawbacks of recurrent neural networks.
All current state of the art algorithms (BERT/GPT2) uses an attention mechanism similar to the Transformer.
What’s a Transformer?
Here is a typical architecture of the Transformer. Very briefly, it’s an Encoder-Decoder set up with Attention mechanism as a key inference layer. First, the information (initial embedding) is encoded into intermediate vectors with the help of Attention weights, that work somewhat like contextual representation. Followed by decoding sequentially given all the available information. Transformers are trained in parallel as explained below.
First, we will break algorithm into individual operations to understand them better, then build relevant functions and finally implement the Transformer in code.
For those who are totally new to Transformer, it is very helpful to check excellent illustration by Jay Allamar here.
Inputs and Positional Encoding
In the context of language translation, sentence pairs are the inputs and targets. Embedding and positional encoding are key steps.
Sentence pairs are tokenized and respective embeddings are used. Note the embeddings are randomly initialized. To each embedding positional vector is added. For example, here is a sentence pair:
[[‘start_o_s’, ‘red’, ‘means’, ‘stop’, ‘end_o_s’], [‘एसओएस’, ‘रातो’, ‘भनेको’, ‘रोक्नु’, ‘हो’, ‘ईओएस’] ]
Tokenized pair -[[2, 30, 5, 85, 100], [8, 12, 98, 12, 27 ]]
Next, get the embedding, these are just randomly initialized vectors of arbitrarily selected dimensions.
Positional encoding
Sine and Cosine values are added to even or odd items in initial word-embedding. The encoding for each item depends upon both the position of the word sequence/sentence and its place/index in the embedding vector for the word. The following equations are used to derive the positional encoding.
Below is an illustration of encoding for different dimensions for a 100-word sequence. The function depicts a temporal relationship.
The following snippet explains more visually
Layer normalization and residual connection:
Layer normalization simply re-scales each vector such that it has a mean of 0 and a standard deviation of 1.
The residual connection just adds the initial vector to which the function applied to the resulting vector.
Both of these helps to train better and faster by optimizing the flow of gradients.
Feed Forward Network:
The feed-forward network sits on top of each attention layer. It’s a fully connected network that projects a given input to some other space which helps to extract relevant features.
The algorithm, it consists of 2 linear transformations and a relu activation on the first transformation.
Attention Mechanism
Ultimately what attention mechanism does is it links every word (projected in different vector spaces corresponding to K, Q, V)in sequence to every other word. Q dot K_transpose computes this network or outer product of words. Softmax scales them from 0 to 1. Finally, these are used as attention weights to sum the values along each dimension of V to get a refined representation of each word.
As seen from the equation and illustration below (by Jay Alammar), initial word embedding is projected to spaces K, Q, V. Dot product of K and Q interlinks them. Products are scaled-down by the root of dimensions of these intermediate vectors which ensures that the model doesn’t get stuck at extreme values with low gradients. Softmax brings these vectors between 0 to 1 and finally weighted sum of values (V) are taken and used as attention scores.
Here is a simplistic matrix operation that mimics the attention, particularly masked attention, while summing outcomes are symbolic rather than accurate results. It is to illustrate how each vector is related to others. One can compute encoder attention similarly ignoring the non-peel mask.
From the illustrated exercise above it shows that attention can be interpreted as graphs. The graphs formed by encoder, decoder and the whole model can be seen below.
The key feature of masked attention is the application of the non-peek mask. Non peek mask is simply a reversed lower triangular matrix of size sequence length and model_dimension. This matrix is further multiplied by -infinity and then added to raw weights (Q dot k_transposed). On applying softmax all values above diagonal yield 0, which nullifies all illegal connections for any given step in the decoder. Encoders do not require non-peek masks.
Final Softmax/getting back the words
To get back the translated words from decoder stack, the transformer uses a linear layer that outputs scores that has the size of vocabulary, then softmax converts it to probability values. Loss is computed using value corresponding to the target word. The index that has top probability gives the prediction once the model is fully trained.
Another feature of the transformer it computes the probability for the whole stack and not just one word. Such that all sentences in a batch could be trained in a parallel fashion. Use of non-peek mask and right shifting the target by one place is essential to get correct set up.
Teacher Forcing
When the decoder is trained instead of using the output it predicted as input for predicting the next word in the sequence, the transformer directly uses the ground truth to predict the next word and to compute the error. As input does not depend upon any of the previous steps it is possible to process all the words in sequence at the same time. This is a key feature of training transformer like architectures allowing for massively parallel processing.
Implementation in code
Now that we covered all the parts of the transformer we can go ahead and build a transformer. Just one more thing is to consider is when we have unequal sentences in a batch. we use padding to keep the whole batch of the same sequence length and padded embeddings are masked during forward propagation.
For this example, I have used a small dataset just to build a prototype. It has English and Nepali sentence pairs and the model learns to translate from English to Nepali. First write all necessary functions then use them to get inputs, weights, and targets followed by training.
A. Pre-process, get vocabularies, dictionaries and embedding
B. Get Embedding, positional encoding and pad_masks
C. Initialize parameters
D. Feed forward and non -peek mask
E. Batching and Combining steps for ease of execution
F. Encoder and Decoder block
G. Load data and parameters
H. Train
I. Translate
[[‘start_o_s’, ‘get’, ‘ready’, ‘to’, ‘eat’, ‘end_o_s’]] → ’खान तयार हुनुहोस् ईओएस’
Limitations
Used simplistic model, embeddings are neither trained nor 3 way tying is implemented, BEAM beach search can be used to find better translations.
Final words
Transformer is the basis for the recent architectures. Two most impactful Language Models, BERT uses encoders and uses masked language models (MLM) while GPT2 uses decoders to predict next word. Using transfer learning these big pre-trained models are extensively used for many Natural Language related tasks.
Reference:
[1] Attention Is All You Need (Vaswani et. al, 2017)
[2] Jay Allamar. The Illustrated Transformer. http://jalammar.github.io/illustrated-transformer/
[3] The Annotated Transformer (April 3, 2018). https://nlp.seas.harvard.edu/2018/04/03/attention.html