The Transformers in NLP

Jaimin Mungalpara
CodeX
Published in
8 min readMar 16, 2021

In this blog we will discuss about The Transformers which outperforms previous methods. However, transformer are based on attention but the concept of parallelization is added in architecture.

In the previous blog we discussed about attention. In recent deep learning models attention is a concept which helped to improve performance of NLP applications like neural machine translation, image captioning and various others. The attention mechanism works like a human. For example, when you hear any sentences your mind will try to figure out important keywords and based on that it can understand context of the sentences. Like this way attention mechanism try at encoder side pass the output of each LSTM cell to attention mechanism and find out importance of sequences based on attention score. Then this context vector along with decoder’s previous output given to next LSTM cell in decoder architecture and figure out the translation.

The paper ‘Attention Is All You Need’ was introduces in 2017 with a novel architecture called Transformer. As the name suggest this architecture uses Encoder Decoder architecture with attention but the parallelization is added in architecture which is shown in below figure.

Transformer Architecture Source :- https://arxiv.org/pdf/1706.03762.pdf

The image can explain the architecture easily. Let’s take a deep dive in this architecture. The encoder is shown on left side which is composed of a stack of N = 6 identical layers. Each layer has two sub-layers which are multi-head self-attention mechanism, and the second is a simple, fully connected feed-forward network. Also, residual connection is added here with layer normalization. 6 Encoder are stacked on each other and out put of last encoder is given to Decoder.

The decoder is also composed of a stacked N = 6 layers. In addition the decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack. Also, residual connection is added with all sub layers followed by layer normalization. The self-attention sub-layer in the decoder is also not like a traditional approach of attention. Here masked multi headed attention is added which is combined with positional embedding to ensure dependency of input and output.

The entire architecture can be seen like this.

Encoder Decoder architecture of Transformers / Source : http://jalammar.github.io/illustrated-transformer/

In traditional encoder architecture we have used RNN/LSTM/GRU as a cell but here in this architecture the approach is changed and encoder cells are replaced by self attention and feed forward neural network stacked over each other.

Encoder architecture of Transformers / Source : http://jalammar.github.io/illustrated-transformer/

Input of the encoder is firstly passed threw self attention layer, as the functionality of attention is to identify the weights according to sequences. Then the output of Self-Attention is passed from Feed forward neural network. Then the output of this single encoder is passed to another encoder layer ( As there are 6 Encoder layer in this architecture ). Finally, the output of last encoder is then sent to decoder. Decoder consist of similar architecture only one layer of Encoder-Decoder attentions ( Similar to Seq2Seq attention ) is added in between. Below figure represents overview of encoder decoder architecture.

Decoder architecture of Transformers / Source : http://jalammar.github.io/illustrated-transformer/

The Encoder

Let’s take a deeper look inside a single encoder cell. The very first step in any NLP problem is to convert text data into numeric representation. So, we have to convert our text into vector representation of 512 size according to research paper. There are multiple ways to convert words into vectors like Word2Vec, GloVe, TFIDF, and BOW. This converted embeddings of size 512 is now sent to bottom encoder’s self attention layer. out put of self attention layer is then sent to Feed forward layer and then it will passed to another encoder layer. The size of input can be decided based on longest sentence. Here, we are taking size of 3 words X1,X2,X3 for example.

Input to Encoder

After embedding the words are in input sequence and all words pass through each layer of an encoder.

Self-Attention

Let’s first look at self-attention how it works when we pass the vectors of size 512. Note:- Size of vector is mentioned according to research paper which is changed according to other researches. The first step is to calculate Query vector, Key vector and Value vector. These vectors are calculated with multiplication of each word embedding with all three metrics. Simply we can tell that, for example, we have input of X11,X12…….X512 will be multiplied with Wq1,Wq2…..Wq512 to calculate Query Vector. Same way Key and Value vectors are calculated, which is represented in below figure.

Query , key and Value vector calculation

The dimension of this vector is reduced according to research paper it is 64 but the output of encoder is size of 512 only . The actual beauty of this model is here because instead of calculating one relations we are finding multiple relations of each words.

Next, we will use this vectors to calculate attention score. We are using attention here so in Out example first word is “I” so we need to check the score if this word with all the words used in sentence. The score is calculated by dot product of query and key vector of respective word we are scoring. For example, we are calculating score for “I” then score 1 would be q1*k1 , score 2 would be q1*k2 and score 3 would be q1*k3. Here we are simply taking a dot product of Query and Key metrics and as a result we will get score.

Calculation of score for a Word “I” with all other words

After calculating the score according to research paper the score is divided by 8 . The actual value of dimension of key vector is 64 so the score is divided by 8. Then the Score/8 value would be passed from softmax function and the value we would get is in between 0–1 and the sum of all values would be 1 as per our attention criteria and softmax properties. This function determines the importance of the word at each place. The word it self have a highest score but we can also determine the relevance of other words to the current word. So, as per research paper the Attention is calculated with below formula.

Formula to calculate Attention / Source : https://arxiv.org/pdf/1706.03762.pdf

Next step is to multiply all the softmax values with their value vectors. With this step we can eliminate all the words which are having tiny softmax values and we can get the relevance of the word on which we need to focus. In next step the weighted value vectors are summed up to generate output of self attention layer for particular word. The resulting vector is sent to feed forward neural network for further processing. Entire flow can be shown in below figure.

Entire flow for calculation of attention layer output

As shown in architecture of transformer ‘Multi-headed Attention’ is used. Till now we have discussed about single headed attention, with the addition of multi headed attention performance of the attention will get improved. First, it will expand the focus on other words when we are dealing with long sequences. Secondly, it includes multiple query, key and value matrices which are randomly initialized. As transformers use 8 different headed attentions, we will end up with 8 different z metrices. This entire flow can be shown in below figure.

Calculation of Z Matrices with 8 different Attention heads

The problem with this step is that feed forward neural network accept single input so all the attention head need to be concatenated. Also, researchers had proposed one more weight W0 here which is multiplied with concatenation of all attentions heads to get final Z. This process is shown below figure.

Formula to calculate Multiheaded https://arxiv.org/pdf/1706.03762.pdf
Reference : http://jalammar.github.io/illustrated-transformer/

Positional Encoding

Positional encoding is the key factor to identify the distance between two words and the sequence of the words at input. To create this encoding all input vectors are added with positional vectors ( Positional vectors follow specific pattern which model learns ) and then it would be given to the input of encoders. Reason behind this is adding these values to the embeddings creates meaningful distance between vectors when they are mapped into Query/Key and Vector. Also, the position of each words and distance between each word can be determined with this positional encoding.

Formula to calculate Position vector
Reference : http://jalammar.github.io/illustrated-transformer/

The Residual Network ( Skip Connection )

In the architecture of encoder residual connection is mentioned around multi headed attention and feed forward network. This skip connection is followed by layer normalization step. The visualization with self attention and layer normalization can be looked like this.

Entire Flow from input to output of Encoder 1

Now the output of first encoder is given to second encoder and this way entire encoder architecture works. If we consider 2 encoder architecture then it would be look like this.

Reference :- http://jalammar.github.io/illustrated-transformer/

The Decoder

In architecture the output of encoder is given to second layer of decoder. The first layer if decoder is Masked-Multi headed attention followed by add & normalization. Masked multi headed attention here takes an output embedding along with the positional embedding same as an encoder. For example, if we are working with language translation, output language is given as an input to decoder. Masking means that some of the words are being masked so that model can learn to predict that data and the masked words would be change in each iteration. This layer will generate the Query metrics for decoder. Key and value metrics are being taken from encoder output which is directly given to encoder-decoder attention as shown in fig. This process is repeat for the entire sequence of the sentence until the symbol of completion is reached. For example :- End of the sentence. The entire flow can be seen in below figure.

Reference :- http://jalammar.github.io/illustrated-transformer/

Finally, the output of stacked decoder is sent to linear layer which is a simple fully connected neural network. Linear layer generates the vectors which are the large vectors from which we can not predict the word directly. This layer we can assume that the dictionary of our entire dataset. Which is then passed from softmax layer to convert into probabilities and the cell with highest probability is chosen and converted into best possible word for that time stamp.

In this blog we had seen the first transformer which is base of all the current researches. In next article we will try to catch up BERT (Bidirectional Encoder Representations from Transformers) and GPT.

Suggestion are welcomed always.

References

  1. http://jalammar.github.io/illustrated-transformer/
  2. https://arxiv.org/pdf/1706.03762.pdf

--

--