Transformer Demystified|Break your Attention and build it back

Published in

towardsdatascience

7 min readDec 8, 2019

Get familiar with ideas and operations that goes into Attention mechanism and Transformer as a whole. Build a prototype from scratch and start translating.

Introduction

Ashish Vaswani and the team from google published the paper ‘Attention is all you need’ in June 2017. They proposed a new language model, the model replaced recurrent or convolutional networks with Attention alone. This relatively simple model achieved better BLEU scores in English to French and English to German translation, at the same time training time was much shorter.

Key criticisms for RNN are poor at learning long term dependencies due to relatively inefficient gradient flow, the directional nature of RNNs inhibits them from using parallel computation. Attention successfully overcame the drawbacks of recurrent neural networks.

All current state of the art algorithms (BERT/GPT2) uses an attention mechanism similar to the Transformer.

What’s a Transformer?

Here is a typical architecture of the Transformer. Very briefly, it’s an Encoder-Decoder set up with Attention mechanism as a key inference layer. First, the information (initial embedding) is encoded into intermediate vectors with the help of Attention weights, that work somewhat like contextual representation. Followed by decoding sequentially given all the available information. Transformers are trained in parallel as explained below.

http://primo.ai/index.php?title=Transformer ( https://arxiv.org/pdf/1706.03762.pdf)

First, we will break algorithm into individual operations to understand them better, then build relevant functions and finally implement the Transformer in code.

For those who are totally new to Transformer, it is very helpful to check excellent illustration by Jay Allamar here.

Inputs and Positional Encoding

In the context of language translation, sentence pairs are the inputs and targets. Embedding and positional encoding are key steps.

Sentence pairs are tokenized and respective embeddings are used. Note the embeddings are randomly initialized. To each embedding positional vector is added. For example, here is a sentence pair:

[[‘start_o_s’, ‘red’, ‘means’, ‘stop’, ‘end_o_s’], [‘एसओएस’, ‘रातो’, ‘भनेको’, ‘रोक्नु’, ‘हो’, ‘ईओएस’] ]

Tokenized pair -[[2, 30, 5, 85, 100], [8, 12, 98, 12, 27 ]]

Next, get the embedding, these are just randomly initialized vectors of arbitrarily selected dimensions.

Positional encoding

Sine and Cosine values are added to even or odd items in initial word-embedding. The encoding for each item depends upon both the position of the word sequence/sentence and its place/index in the embedding vector for the word. The following equations are used to derive the positional encoding.

Below is an illustration of encoding for different dimensions for a 100-word sequence. The function depicts a temporal relationship.

source — https://d2l.ai/chapter_attention-mechanism/transformer.html

The following snippet explains more visually

Layer normalization and residual connection:

Layer normalization simply re-scales each vector such that it has a mean of 0 and a standard deviation of 1.

Where x_ij represents a ith vector from a batch of of size m and j is an element in the vector.

The residual connection just adds the initial vector to which the function applied to the resulting vector.

Both of these helps to train better and faster by optimizing the flow of gradients.

Feed Forward Network:

The feed-forward network sits on top of each attention layer. It’s a fully connected network that projects a given input to some other space which helps to extract relevant features.

The algorithm, it consists of 2 linear transformations and a relu activation on the first transformation.

Attention Mechanism

Ultimately what attention mechanism does is it links every word (projected in different vector spaces corresponding to K, Q, V)in sequence to every other word. Q dot K_transpose computes this network or outer product of words. Softmax scales them from 0 to 1. Finally, these are used as attention weights to sum the values along each dimension of V to get a refined representation of each word.

As seen from the equation and illustration below (by Jay Alammar), initial word embedding is projected to spaces K, Q, V. Dot product of K and Q interlinks them. Products are scaled-down by the root of dimensions of these intermediate vectors which ensures that the model doesn’t get stuck at extreme values with low gradients. Softmax brings these vectors between 0 to 1 and finally weighted sum of values (V) are taken and used as attention scores.

source -http://jalammar.github.io/illustrated-transformer/

Here is a simplistic matrix operation that mimics the attention, particularly masked attention, while summing outcomes are symbolic rather than accurate results. It is to illustrate how each vector is related to others. One can compute encoder attention similarly ignoring the non-peel mask.

Simplified representation of attention mechanism. Let’s assume a sequence has 3 words a, b, c and the embedding for each of them are just 3-dimensional vectors with the respective alphabet in all dimensions. Lets the weight Wq, Wk and Wv are all equal to 3 x 2 matrix with 1s in the first column and 2s in the second column. First, the embeddings are projected in these spaces as 2d vectors represented by a1, a2, b1 b2 and c1 c2 respectively. The dot product of K ad Q are filled are again simplified and items are represented by 2 letters belonging to each word being multiplied. This is the most important concept of attention where all words cross interact with each other. K dot Q_t is symmetric and each item in the vector represents a cross relationship with the same or different word(s). Hence for the decoder, a non-peek mask to be added is reversed lower triangular matrix with values above diagonal is replaced with -inf, such that when softmax is applied all values above diagonal becomes 0, thus cutting off the weights beyond the word itself or those lying in the future. Finally, these weights are used to sum values V.

From the illustrated exercise above it shows that attention can be interpreted as graphs. The graphs formed by encoder, decoder and the whole model can be seen below.

**Top left -Source language graph**. This is a complete graph, each token s_ican attend to any other token s_j (including self-loops). **Top right -Target language graph**. The graph is half-complete, in that t_i attends only to t_j if i>j (an output token can not depend on future words). **Bottom -Cross-language graph**. This is a bi-partitie graph, where there is an edge from every source token s_i to every target token t_j meaning every target token can attend on source tokens. Source: Transformer tutorial ( Zihao Ye)

The key feature of masked attention is the application of the non-peek mask. Non peek mask is simply a reversed lower triangular matrix of size sequence length and model_dimension. This matrix is further multiplied by -infinity and then added to raw weights (Q dot k_transposed). On applying softmax all values above diagonal yield 0, which nullifies all illegal connections for any given step in the decoder. Encoders do not require non-peek masks.

Application of non-peek mask. Sources, top: http://www.peterbloem.nl/blog/transformers bottom: http://jalammar.github.io/illustrated-gpt2/

Final Softmax/getting back the words

To get back the translated words from decoder stack, the transformer uses a linear layer that outputs scores that has the size of vocabulary, then softmax converts it to probability values. Loss is computed using value corresponding to the target word. The index that has top probability gives the prediction once the model is fully trained.

Another feature of the transformer it computes the probability for the whole stack and not just one word. Such that all sentences in a batch could be trained in a parallel fashion. Use of non-peek mask and right shifting the target by one place is essential to get correct set up.

Decoder stack is being trained in parallel.

Teacher Forcing

When the decoder is trained instead of using the output it predicted as input for predicting the next word in the sequence, the transformer directly uses the ground truth to predict the next word and to compute the error. As input does not depend upon any of the previous steps it is possible to process all the words in sequence at the same time. This is a key feature of training transformer like architectures allowing for massively parallel processing.

Implementation in code

Now that we covered all the parts of the transformer we can go ahead and build a transformer. Just one more thing is to consider is when we have unequal sentences in a batch. we use padding to keep the whole batch of the same sequence length and padded embeddings are masked during forward propagation.

For this example, I have used a small dataset just to build a prototype. It has English and Nepali sentence pairs and the model learns to translate from English to Nepali. First write all necessary functions then use them to get inputs, weights, and targets followed by training.

A. Pre-process, get vocabularies, dictionaries and embedding

B. Get Embedding, positional encoding and pad_masks

C. Initialize parameters

D. Feed forward and non -peek mask

E. Batching and Combining steps for ease of execution

F. Encoder and Decoder block

G. Load data and parameters

H. Train

I. Translate

[[‘start_o_s’, ‘get’, ‘ready’, ‘to’, ‘eat’, ‘end_o_s’]] → ’खान तयार हुनुहोस् ईओएस’

Limitations

Used simplistic model, embeddings are neither trained nor 3 way tying is implemented, BEAM beach search can be used to find better translations.

Final words

Transformer is the basis for the recent architectures. Two most impactful Language Models, BERT uses encoders and uses masked language models (MLM) while GPT2 uses decoders to predict next word. Using transfer learning these big pre-trained models are extensively used for many Natural Language related tasks.

Reference:

[1] Attention Is All You Need (Vaswani et. al, 2017)

[2] Jay Allamar. The Illustrated Transformer. http://jalammar.github.io/illustrated-transformer/

[3] The Annotated Transformer (April 3, 2018). https://nlp.seas.harvard.edu/2018/04/03/attention.html

Transformer Demystified|Break your Attention and build it back

Written by Munesh Lakhey