Demystifying Transformers: A Deep Dive into the NLP Revolution

Dhruv Pamneja
13 min readMay 11, 2024

--

For many tasks based on natural language such as machine translation, language modelling and other derivative tasks (eg. text summarisation, sentiment analysis, question answering), the encoder-decoder model coupled with attention has proved to be a powerful performer in successfully handling such real word complex problems. However, there was still room for improvement in this architecture.

Processing data sequentially, as is required for textual data, limited the models’ efficiency and scalability for complex tasks so to say. Despite their superior performance over previous SOTA models, they still took a significant amount of time to converge during training.

Additionally, while the attention layer did assist to capture relationships between various parts in the sequence, it did not fully address the challenge of modeling long-range dependencies over the entire sequence.

Following the above, in the year 2017 a paper published by a team of researchers from Google and University of Toronto proposed a model called the Transformers, which proved to be a game changer in the domain of performing NLP tasks. This is considered one of the most revolutionary papers in the field of NLP tasks and language based models, about which you can read more here.

Introduction

Now, this model as well uses an encoder-decoder architecture at it’s base level, as we can see in the image below :

As we can see above, we give an input sentence to the encoder in the source language and it outputs it’s translation in the target language, pretty much like how it was done earlier. Now, if we decide to see inside these boxes, we would find :

The encoding layer of the model has six encoders stacked on top of one another, which receive the input sequence from the first and pass it along. Similarly, the decoder component is stacked with six layers of decoders, each of which receive from the final layer of the encoder and pass on the output from the last decoder layer.

Here, an interesting thing to notice is the number of encoders and decoders inside their respective components. This can be thought of as a hyperparameter, as is assumed in the research. After various parameterisation analysis, this number of components in both components yielded the best results, hence we will assume this to be fixed for now.

Let us now look inside the individual layers to understand further. First, let us look inside an encoder layer, which is as follows :

The input is first received by a layer of self-attention, about which we will expand on ahead. To put it simply for now, it allows the encoder to have a view of other words of the input stream as it encoders the current given word.

After this, the next process is for the data to head to a simple feed forward neural network which processes the data and sends it ahead. Note that all encoders are identical in nature, despite they do not share any weights among themselves.

Workings of the Encoder

Now firstly, let us understand what the process is. First, the encoder receives an input collection of words from the sentence, where it can convert each word into vectors to be passed to the network. Now, this is done via the embedding layer, which basically converts all words into fixed dimensions of vectors. Here, it makes each word into a vector of 512dimension. (Again, this is a hyperparameter which was found to bring optimal results).

After we are done with the embeddings, we pass on the vectors to the self-attention layer which does it’s operations and passes it onto the feed forward neural network. Here, we will see a core property of the transformer model, which is that each word can now flow on their own path, although the data still flows sequentially through the layers, it is now moving parallelly.

This means that now instead of the data going sequentially, it now goes all at the same time or parallelly to the self attention layer, which works on the dependencies between these paths.

Since the feed forward network will not be having those dependencies, the words can move parallelly while moving through the network. Let us understand the above with the below given diagram:

As we can see above, words as embedded as vectors e1,e2,e3 and e4. They are passed to the self attention layer parallelly, which yield the vectors z1,z2,z3 and z4. Following that, each of them are passed to an individual feed forward neural network so that they can be processed parallelly and no dependencies are formed at this level.

In the end of this layer, it yields the output of this embedding cell in the form of r1,r2,r3 and r4, which are then passed on to the next embedding cell. Now, let us understand what exactly is self attention and how does it work in this model.

Self-Attention

Now as I head into this section, let us see the example given below :

The animal didn’t cross the street because it was too tired

Over here, if we were to ask the simple question that what does the word “it” refer to here, we would be correctly able to answer the animal, rather than the “street” However, such an understanding of the language requires context and to see the sentence in it’s totality.

In comes the concept of self-attention, which allows the model that while it processes each word, it can view all the other input words so as to gain a better insight from it and create a better embedding for the same. To visualise the same, let us look at the image below :

Source

As we can see above, with the help of self attention the network is able to direct it’s focus on all other words, as it tries to encode the word “it”. As we can see above, it has also directed a relatively greater focus to the word “animal” and “the” with darker lines, and bakes a part of them into it’s encoding. Let us now understand self attention in detail.

Firstly, let us assume the vectors x1 and x2 to be the embeddings of our given input words. As we move this into the self attention layer, three weights namely:

  • Query weights (Wq)
  • Key weights (Wk)
  • Value weights (Wv)

Now, the resultant vectors from these weights are suppose to be of 64 dimension, and the input is of 512 dimension, hence the dimensions of all the weights will vary accordingly. The weights will be initialised randomly, as they are set to be updated via back propagation. The self attention layer then multiplies the input vectors with all these three weights, to produce :

  • Query vector (q1 and q2)
  • Key vector (k1 and k2)
  • Value vector (v1 and v2)

Now, once we have the vectors, we will be calculating a score for each word in the input a given formula. To take the first query vector, the score for that will be as follows :

Following which, we divide the score by 8 (this again, is the square root of the dimension of the keys vector, which is proved to be ideal in research, for having stabler gradients).

In normalise the above scores, we perform a softmax operation on them. Apart from that, the softmax also is able to tell us how much of each word will be expressed at this given position, as sometimes it is relevant to attend to another word which is relevant to the current word.

We can visualise the formula as shown below :

After this step, we perform the multiplication of this softmax score with the value vector which we had obtained before. The main aim to do this keep the value of the words which we actually want to focus on, while eliminating irrelevant words from the picture (by saying multiplying them values tending to zero to nullify their effect).

Finally post which, we calculate the sum for the weighted value vectors, to gain the output of the self attention layer at this given word. This allows the self-attention layer output at that particular word also to take into account the context of all the other words.

In actual implementation, the embeddings of all the words is packed into one matrix and sent at the same time, effectively sending all the data parallelly.

Along with that, it also allows for matrix-processing of the data, which results in faster computation, the formulation for which is as follows :

Now, one slight issue here is that we are using the a singular set of weights for all words across the sentence, which contains many words. Basically, what we are performing right now is “single-head” attention. As we saw in the image above, the lines at the position “it” were darker for the words “the” and “animal” as they should be. However, even the word “tired” is related to “it” in the given context, and we also want some representation of it in our words, which we are not able to accurately do owing to the same weights being used for all words.

Multi Head Attention

This aims to solve the problem of having the same set of weights of queries, keys and vectors for the entire input stream. To tackle this as the name suggests, the use of 8 different attention heads was done to pass an input word to all these attention heads, so as to get the accurate representation and context of each word across the data. This is so as while a single head may focus on one certain type of dependency, other heads may focus on different language and contextual dependencies so as to learn to attend to various relationships simultaneously, potentially leading to richer representations.

The research referenced experimented with different architectures and found that 8 heads offered a good balance between performance and computational cost for their tasks, hence the 8 heads. Again, this can be interpreted as hyperparameter.

We can visualise it something as shown in the image below :

With this, we finally obtain 8 values in the form of Z0,Z1,Z2….Z7. Now, our objective is to club them into one, so that one final value can represent the the output of all the attention layers for a given word.

For this, we first concatenate all the 8 vectors into one single vector, and then multiply it with another different weights matrix, which has weights which will be adjusted as the models train. This results into a single final vector Z for that given position, which captures information from all the attention heads and now can be passed onto the feed forward neural network. We can see in the image below how the multiple heads will be able to capture relationships :

Source

With the above, we can see how that while encoding “it”, one attention head is focusing more on the word “animal”, while the other attention head is focusing on the word “tired”. We can conclude this as the final representation of the word “it” having references to both “animal” and “tired”.

Positional Encoding

Now, in a sentence, we as humans understand that the ordering of the word has a immense impact on the overall meaning of the sentence. Therefore, it would be beneficial if the machine also has some way of seeing how the words are positioned. Let us take the below given sentence as an example :

I am a undergraduate student

Here, we need to communicate that how positionally the word “I” is closer to rest of the words in our sentence, so as to get better representations when fed to the network. So to do so, we add a vector to the input embedded vector, which represents a time signal and bakes the position of the current word into it’s embedded form.

The time signal vectors are basically to indicate how far a word is from rest of the words, which is done via the distance calculation between two vectors. We can see the visual representation of the same as shown below :

With this, we get the final input embeddings which we then send to the encoder layer as described in the previous sections.

Residual

Now one thing which is very interesting and plays a role before we complete the encoder process, is the existence of residuals and the step for normalisation. Now, to understand what exactly is a ResNet (Residual Neural Network) we can view the visualisation below :

In the above network, if after computation of interconnected hidden layers the network realises that it would be better to skip the hidden layer 2 and directly pass on the output to the hidden layer 3, it can straight away do so in the architecture of ResNet.

Similarly, if during computation, our encoder realises that the self attention layer or the feed forward neural network is not performing optimally or does not add value as intended, it can straight away send the vectors ahead and bypass that layer altogether. With this, we normalise the output of the layer and the input vectors before passing it on for the next process. We can see the image below for better understanding :

Decoder

Now, let us try to understand the decoder and how it integrates with the output received by the encoder. The decoder is similar in architecture to the encoder, the only difference is that in between the self-attention layer and the feed forward neural network, we will be introducing the Encoder-Decoder Attention.

This can be thought of as another attention layer only, but the difference here is that this will be receiving input from the previous layer and the whatever output the encoder has yielded. So to better understand, let us look at the image below, which shows the entire flow of the encoder and decoder.

With this, we can see how both the encoder and decoder work together and how their architecture looks. In the actual model as we have seen above, six encoders are connected to form the encoder layer and pass on the output to a decoder layer with six decoders.

The decoder passes the output words one by one, and as they are given as a output, they are passed back into the decoder to assist in predicting the next word, till the end of stream is reached.

Linear and SoftMax Final Layer

From the previous layer, we receive an output R’ which is a vector of floats. Now, the conversion of this into words is the job of the final linear layer and over it the softmax layer.

The linear layer is nothing but neural network which converts and give the vector in a much larger form, which can be called the logits vector. The aim of this vector here is to convert the output to the size of the words in our dictionary, which has been learnt by the model in it’s training data set. The logits vector is these many dimensions wide, where each cell in this vector corresponds to a score of the words in the dictionary.

The softmax layer then converts each cell into a probability, which implies the probability of that word in the dictionary associated with the output given by the decoder, with the highest probability being chosen as the final output.

We can now finally understand and see the visual representation of the entire working of the model, as given in the paper:

Conclusion

It is true that transformers have revolutionised NLP tasks, there’s always room for improvement. A few points to consider in this direction are as follows:

Computational Cost

  • Although transformers offer parallel processing, they can still be computationally expensive, especially for very long sequences. This limits their application on resource-constrained devices.

Interpretability

  • Understanding how Transformers arrive at their decisions can be challenging. This makes it difficult to debug errors or identify biases.

A few future directions and methodologies to tackle them could be as follows :

Efficient Transformer Architectures

  • Researchers are developing more lightweight Transformer architectures that achieve similar performance with lower computational costs. This could involve techniques like pruning or quantisation.

Explainable AI (XAI) Techniques

  • Researchers are exploring ways to make Transformers more interpretable. This could involve developing methods to visualise attention weights or understand how different parts of the model contribute to the final output.

Overall, transformers can be seen as a powerful and versatile architecture and a base on which today’s wide range of generative AI and NLP tasks stand on.

Credits

One key contribution to this article is the blog written by Jay Alammar, which has helped me tremendously and has contributed to the field of machine learning and AI overall, along with his informative and enriching videos. You can check out his YouTube channel here.

Also, I would like to take the opportunity to thank Krish Naik for his series in deep learning on his Youtube channel, which has allowed me to learn and present the above article. You can check out his YouTube channel here. Thanks for reading!

--

--

Dhruv Pamneja

Just a bloke trying to learn and grow, with a passion for new AI technologies and products.