Transformers — Let’s Dive Deeeep!

Sukanya Bag

Published in

Analytics Vidhya

13 min readDec 1, 2020

WARNING !!

This blog is not for people who fear math !

(But make sure to give it a read if you get offended :P)

So, what are Transformers ?

By now, you must be thinking of the robots in Michael Bay’s series of American science fiction action films !

Well nope, this is something different.

The Transformer is a deep learning model introduced in 2017,with its architecture proposed in the paper “Attention is All You Need” based on a self-attention mechanism and used primarily in the field of natural language processing (NLP).

What makes the Transformer a “Novel Neural Network Architecture for Language Understanding” ?

Unlike RNNs, Transformers do not require the sequential data to be processed in order. For example, if the input data is a natural language sentence, the Transformer does not need to process the beginning of it before the end. Due to this feature, the Transformer lends itself to parallelization than RNNs and therefore reduced training times.

While I went to study the research paper which proposed this model ( the one I mentioned above), I was too dumb to understand even a single sentence or the complex computations it covered in it. Hence, after spending nights on various reference blogs, I finally managed to crack this huge beast of NLP !

Here, in this blog, I will try my best to simplify things of the paper, which you need to know to get a decent knowledge of Transformer’s architecture ! Trust me, YOU’RE GONNA LOVE IT !

SO, LET’S START !

1. An Oversimplified Look (to not scare you) :

Let’s not complicate things at the beginning and look at the huge state of the art model like a simple black box, which simply takes a sentence input in one language (say here we took English) and translates it to another language as output (say here it outputs in German).

2. Time to Crack Open the Beast ! (but slowly :”3)

So, after breaking it up, we see there are two major components inside it. An Encoder component at the left, and a Decoder component at the right. Let’s introduce you to the respective roles of an Encoder and a Decoder before we move forward.

Encoder — The encoder maps an input sequence into an abstract continuous representation that holds all the learned information of that input.

Decoder — The decoder then takes that continuous representation and step by step generates a single output while also being fed the previous output.

In the proposed paper , the researchers have taken Encoders as a stack of 6 encoders, and the Decoder component is also composed of a stack of decoders of the same number. It must be mentioned here that, they have tried with many variations in the number of encoders and decoders where on taking 6 as the hyperparameter, the best results were obtained, hence they proposed 6 in the paper.

3. Now , let’s dive deep into Encoder Architecture !

The Encoders share the same identical structure, with each one of them consisting of two sub-layers, the Self-Attention Layer and the Feed Forward Neural Network Layer. Look at the diagram below.

Self -Attention Layer : The self-attention Layer enables the model to learn the correlation between the current words and the previous part of the sentence.

red colour indicates current word and blue colour indicates correlation activation level.

Briefing it, all that the self attention layer does is, it allows the Encoder to look at other words in the input sentence, correlate them, capture their importance, while it encodes a specific word.

Consider the following sentence :

“The boy was unable to play the match as he was injured.”

The intuition that “he” in this sentence refer to the boy, and not to the word “injured”, can be immediately understood by a human, but it is a definitely harder task for a learning algorithm. So, when the model is processing the word “he”, self-attention allows it to associate “he” with “boy”.

Self Attention computation still requires a pretty much deep understanding, as it has a number of complicated steps going inside of it, which needs to be observed carefully. Let’s look at them one by one. (You really need patience for this :0)

Step 1 : Creating Query vectors(q1,q2), Key vector(k1,k2), and Value vector(v1,v2).

The key/value/query concepts usually come from information retrieval systems. You can think of simply as a ‘query’ is mapped against a set of related ‘keys’, to retrieve the best ‘values’ for the given ‘query’. These vectors are created by multiplying the embedding by three matrices trained during the training.

So if X1 and X2 are our embedding inputs, and W(Q), W(K) and W(V) be the weight matrices, w.r.t query, key and value, then our query, key and value vectors can be calculated by the formula-

X1 x W(Q) = q1, X2 x W(Q) = q2

X1 x W(K) = k1 , X2 x W(k) = k2

X1 x W(V) = v1 , X2 x W(V) = v2

Step 2 : Calculate a score for each word

Now we need to compute a score for each word of the input sentence against the word we are considering once at a time. This process is very crucial as it establishes a attention score, or basically how much attention to put on that particular word compared to the other words while encoding a word at a certain position.

The score is calculated by —

Computing the dot product of the query (q1,q2) vector with the key vector (k1,k2) of the respective word taken.

q1 . k1 =score for word in position 1

q2 . k2 = score for word in position 2 … and so on.

One important thing to note here is that the dimension of the query, key and value vectors are much much smaller than the embedding vectors. In the proposed paper the embedding vectors dimension is taken as 512, whereas the q ,k and v vectors are said to have dimensions of 64.

Step 3 : Dividing the obtained scores by the square root of the dimension of key vector (d(k)) and applying a softmax activation to it.

The third step includes dividing the previously obtained scores by 8 (since root(d(k))= root(64)=8).

Note that (1/√dk ) is the scaling factor which provides a more stable gradient to the model. The size of the dot product tends to grow with the dimensionality of the query and key vectors though, so the Transformer rescales the dot product to prevent it from exploding into huge values.

So, the general formula for this step is as follows :

Score/root(d(k)) = (q1.k1)/8, (q1.k2)/8,…. so on.

Now we apply a softmax activation to the values obtained, which normalizes the scores so they can add upto 1.0 .

Step 4 : Calculating the Scaled Dot Product Attention

Oversimplifying this step, we can say that this step merely includes multiplying each Value vector by the softmax score obtained previously. The reason to do this step is basically to keep the values of the word(s) we want to focus on, and kick-out the irrelevant words. After this we simply sum up the weighted value vectors to obtain a new vector say Z.

So the formula for this whole process of self attention calculation goes here -

So that’s how the self attention is calculated. Hope the equations look a bit of less mesmerizing now :P.

The resulting vector is one we can send along to the feed-forward neural network !!

Feed Forward Neural Network — The outputs generated by the Self Attention Layer is then passed to the Feed Forward Neural Network , where the same FFNN layer is applied to each position independently. The words in each position follow their own path in the encoder. There are dependencies between these paths in the self-attention layer, but the feed-forward layer does not have these dependencies. So, the different paths can be run in parallel while crossing the feed-forward layer.

Hope that by now you have had some intuition of the major components of the Encoder.

So, now let’s dive deeper and get the vectors in to the frame !

If you are familiar with basic NLP techniques of text pre processing, you must have been aware of the task that includes converting each input word into a vector using word embedding techniques (with Word2Vec, tfidf encoding, Latent Semantic Analysis Encoding, binary encoding, etc. ).

The embedding occurs in the lowermost encoder. The encoders receive a list of vectors each of dimension 512. The dimension is simply a hyperparameter, which we need to fine tune. Generally, it would be the length of the longest sentence in our training data.

After embedding the words in our input sequence, each of them flows through each of the two layers of the encoder specified above and finally the output of the FFNN is passed on to the next encoder layer.

Now that you got a clear picture in your mind regarding how the Encoding process is executed, let’s get into the fantastic concept of “Multi Headed Attention” developed by the researchers.

4. The Multi-Headed Beast

The attention mechanism in the Transformer is interpreted as a way of computing the relevance of a set of values(information)based on some keys and queries.

If we only computed a single attention weighted sum of the values, it would be difficult to capture various different aspects of the input. To solve this problem the researchers proposed the Multi-Head Attention block.

This computes multiple attention weighted sums instead of a single attention pass over the values — hence the name “Multi-Head” Attention.

Multi-Head Attention applies **different linear transformations** to the values, keys, and queries for each “head” of attention.

Don’t complicate your thoughts on “multi-headed attention” architecture, rather just think of it like this —

Self-attention calculation we discussed above just needs to be repeated 8 different times with different weight matrices, where weights are randomly initialized, thus creating multiple self attention models.

So, now we end up having eight different Z matrices (Z0,Z1,Z2,…..,Z7) , where Z0 = matrix of attention head 0, Z1 = matrix of attention head 1.. and so on.

HURRAY !!

Wait ! It’s not yet done. We need to remember that the Feed Forward Neural Network will only accept a single matrix (a vector for each word). So we have to concat all of these 8 matrices to a single matrix.

How?

Well, not a big deal.

We concatenate the matrices then multiple them by an additional weights matrix W(O), trained along with the model.

So what did we get now ?

We got a Z matrix which stores relevant information from all the 8 attention heads, and now can be sent to the FFNN.

YAYY !!

A visual description to all what we discussed till now —

source — http://jalammar.github.io/illustrated-transformer/

Positional Encodings — An important concept

Unlike recurrent neural networks and LSTM-RNNs, the multi-head attention network cannot naturally make use of the position of the words in the input sequence. Without positional encodings, the output of the multi-head attention network would be the same for the sentences “I like chicken more than fish” and “I like fish more than chicken”. Positional encodings explicitly encode the relative/absolute positions of the inputs as vectors and are then added to the input embeddings.

When is Positional Encoding performed ?

Positional Encoding is a crucial step to perform after transforming the words to vectors, and before passing the same to the embeddings.

Benefits of using Positional Encoding -

Provides meaningful distances between the embedding vectors, while they are projected into Q,K,V vectors.
It provides the model, a sense of the order of the words, thus following a specific patterns of the words.

Intuition — This technique is used because there is no notion of word order (1st word, 2nd word, ..) in the proposed architecture. All words of input sequence are fed to the network with no special order or position (unlike common RNN or ConvNet architectures), thus, model has no idea how the words are ordered. Consequently, a position-dependent signal is added to each word-embedding to help the model incorporate the order of words. Based on experiments, this addition not only avoids destroying the embedding information but also adds the vital position information.

Note — What a positional encoder does is to get help of the cyclic nature of sin(x)and cos(x) functions to return information of the position of a word in a sentence.

The paper uses the following equation to compute the positional encodings:

where, pos represents position and i represents dimension , with d(model)=512, (thus i∈[0,255]) in the original paper. For every odd index on the input vector, create a vector using the cos function. For every even index, create a vector using the sin function. Then add those vectors to their corresponding input embeddings. This successfully gives the network information on the position of each vector.

Residual Dropout

The authors applied dropout to each sublayer before adding it to the original input. They also applied dropout to the sum of the embeddings and to the positional encodings. The dropout rate was 0.1 by default.

Each sub-layer (self-attention, FFNN) in each encoder has a residual connection around it, and is followed by a layer-normalization step.

The basic concept behind this is skipping the Self Attention layer, if it is not required, by a layer normalization (similar to dropout).

It adds the X and Z matrices and normalizes it at a specific rate (0.1 used in the paper). This goes for the sub-layers of the decoder as well.

So, all we needed to know inside the Encoder block of the Transformer , is over !

WAIT…, you need to finish this off before getting your coffee!

The Decoder Block is still left to discover ! Let’s jump in !

The Decoder

The Decoder architecture is similar to that of an Encoder, except of an extra “ masked multi-head attention” layer which helps the Decoder put more attention on the relevant places in the input sequence.

Remember, decoders are generally trained to predict sentences based on all the words before the current word.

So, this layer attends over the previous decoder inputs, so plays a similar role to the decoder hidden state, to mask the inputs to the decoder from future time-steps.

Great Job !

So, to sum up how the encoding and decoding work, let’s create a few easy steps -

The encoder starts by processing the input sequence.
The input goes through an embedding layer and positional encoding layer to get positional embeddings.
These are fed to the decoder’s 1st “masked multi-head attention layer” to mask the inputs to the decoder from future time-steps, so that it can compute the attention scores for the decoder’s input.
The first multi-headed attention layer outputs are the values. The second multi-headed attention layer’s outputs are the queries and the keys. This step matches the encoder’s input to the decoder’s input, allowing the decoder to decide which encoder input is relevant to put attention on.
The output of the second multi-headed attention goes through a pointwise feedforward layer for further processing. The output of each step is fed to the bottom decoder in the next time step.
These steps keeps on running until the sequence is ended, or end of string <eos> signal is reached indicating the transformer decoder has completed its output.

The Final Layers- Linear Classifier and Softmax Layer

After the Decoder layer processes the outputs as vectors, the output of the final pointwise feedforward layer goes through a final linear layer, that acts as a classifier. It has cells as big as the size of the output vocabulary you got by making your model learn from the training dataset.

This huge output is then fed to a Softmax Layer to generate probability values ranging between 0 to 1.

So, to simplify things for you,

index of the highest probability score (argmax) = our predicted word.

And That Was All About It!

Hope this blog will help you understand the research paper and save your hours of efforts on making sense while reading it.

Also try to get your hands dirty by running the Tensor2Tensor notebook where you can load a Transformer model, and play with the code with some helpful interactive visualizations.

If you are a beginner in Data Science and Machine Learning and have some specific queries with regard to Data Science/ML-AI, guidance for Career Transition to Data Science, Interview/Resume Preparation or even want to get a Mock Interview before your D-Day, feel free to book a 1:1 call here. I will be happy to help!

Happy Learning !

Until next time !