Photo by Alina Grubnyak on Unsplash

A Step By Step Guide To Transformers

Shashank Vats
𝐀𝐈 𝐦𝐨𝐧𝐤𝐬.𝐢𝐨
15 min readNov 22, 2023

--

In 2017, Google released the paper titled Attention is all you need Vaswani et al. that ushered a new era in the world of machine learning and natural language processing. This groundbreaking paper introduced the Transformer model, which revolutionized the way machines understand and process human language. The Transformer model, with its novel attention mechanism, enabled significantly improved efficiency and effectiveness in handling sequential data compared to previous models like RNNs and LSTMs. This advancement paved the way for the development of more sophisticated and powerful language models, such as BERT, GPT-4, and has had a profound impact on applications ranging from language translation to content generation, fundamentally changing the landscape of artificial intelligence research and applications. So, understanding the transformer architecture is crucial if you want to know where machine learning is heading.

Transformer Architecture

The transformer architecture is shown in the above figure. It might look daunting at first but after delving into its components, one can appreciate its elegant design and efficiency. The transformer, at its core, is composed of encoder and decoder. As we proceed with this article, we will explore the intricacies of these components and how they work together to revolutionize the field of natural language processing.

Encoder

Stage 1: The Embedding Stage

Lets suppose we have a translation problem and we want to translate the input sentence “I am Shashank” from english to french. The encoder takes the advantage of parallelization and sends all the input data at once. These input words are converted into a sequence of tokens which is a set of integers that represent our input. Once we have this sequence of integers, we convert them into embeddings. These embeddings are multi-dimensional vectors that capture more than just the one-to-one representation of words; they encode the semantic and syntactic nuances of each word. For instance, the embedding for “I” would capture its role as a pronoun, while the embedding for “Shashank” would represent it as a proper noun. This level of detail is crucial for the model to understand and maintain the meaning of the sentence during translation.

The Embedding Stage

Once the embeddings are created, position encoding is added to each one. This is a critical step in the Transformer model, as it doesn’t inherently understand the order of words in a sequence. The position encoding adds information about the relative or absolute position of the tokens in the sequence. The position encoding can be either fixed (using a predefined mathematical formula) or learned (as parameters during the training process). A common approach for fixed positional encoding uses sine and cosine functions of different frequencies:

Here pos position and i is the dimension. Each dimension of the positional encoding corresponds to a sinusoid of a different wavelength. This way, the model can distinguish between “I am Shashank” and “Shashank am I”, which have the same words but different meanings due to their order.

Stage 2: The Encoder Block

The embeddings provided in the first stage are static. While they convey information about the words they represent, they lack context-specific knowledge. Consider the word “bank,” which can have multiple meanings depending on its usage. In the sentence, “I went to the bank at the bank of the river near the banking road,” “bank” can refer to three different concepts: a financial institution, the land alongside a river, or an angle. The self-attention layer addresses this ambiguity by considering the context in which each instance of “bank” is used and incorporates those information as well into their respective embeddings. Let’s dive in and see how it works!

  • Generating K,Q,V

The embeddings coming from the embedding layer is multiplied with three different weights to generate three different tensors — Query (Q), Key (K), Value (V). These weights are learnt during the model training phase and are same for a given model irrespective of the input being sent to it.

  • Query (Q): This represents the current word or token that we’re focusing on in the sequence. The Query is used to score against all Keys.
  • Key (K): Keys are representations of all tokens in the input sequence. They interact with the Query to determine the level of focus or attention that each part of the input should receive.
  • Value (V): Values are also representations of all tokens in the input sequence. Once the interaction between Query and Key determines the attention scores, these scores are used to weigh the corresponding Values, which are then summed up to produce the final output of the attention layer.

To understand this with the help of analogy, imagine you’re in a room full of toy boxes, each box containing different kinds of toys. You’re holding a picture of a toy that you want to find.

Here, Query will be the picture of toy you’re looking for. Each toy box has a label that tells you a little bit about what’s inside (like a box labelled car, or doll). These labels are like Keys. This tells which box is worth opening to find the toy in the picture. The toy inside the boxes are the Values. Once you decide which boxes to open based on the how closely their labels (Keys) match with your picture (Query), you get to play with the toy (Value) inside.

Once Q,K,V are obtained, they are used in self-attention calculation. These attention scores are calculated by taking the dot product of Query with all the Keys followed by scaling factor (this is usually the square root of dimension of Key vectors). This scaling is done to prevent the softmax function, which comes next, from having extremely small gradients when dealing with large values. It helps in maintaining numerical stability in the model’s calculations. The scaled scores are passed through softmax function. The softmax turns these scores into probabilities ensuring that they will add up to 1. The result is a distribution that tells us how much each word in the Key (and therefore in the sentence) should be attended to for each word in the Query. The final step involves taking these attention probabilities and using them to weight the Value (V) matrix. Each Value vector is multiplied by the attention scores. This means that words deemed more relevant get more attention — their Value vectors are weighted more heavily. The weighted Value vectors are then summed up to produce the final output for each word in the Query. This output is a representation of each Query word, enriched by the context of the other words in the sentence.

This entire paragraph can be represented using the equation:

  • Multi head self attention

Now, if we were to use a single self attention, the network would be limited in capturing different kinds of relationships. Therefore, we use multiple heads to capture as many relationships as we can. These heads work which not only allows for capturing multiple relationships simultaneously but also contributes to the efficiency of the model. This parallel processing is a key factor in the high performance of Transformer models.

If you’re familiar with CNN, different heads can be thought of as different filters that capture various features or patterns from the input. In the context of self-attention, these different heads allows the network to focus on different parts of input sentence simultaneously, each head capturing different kinds of relationships or features. For example, one head of a multi-head attention mechanism might focus on the syntactic aspect of a sentence such as grammatical structure while other might be concerned with the semantic aspect like meaning of specific word combinations.

Each head computes attention independently, resulting in separate output for each head. These outputs are concatenated to form a single vector. This long vector contains insights and diverse perspective of all the heads. This concatenated vector is passed through the linear layer which is essentially a trainable feed forward neural network layer. This layer has its own set of weights and biases. The purpose of this linear transformation is to bring back this concatenated vector to size that matches the model’s dimensionality. This step also allows the model to learn how to best combine and utilise information from all the heads.

  • Add & Norm

After the concatenation of linear transformation step in multi-head attention mechanism of transformer, the next steps are “Add & Norm” layer which includes the residual connection and layer normalization.

The output from the multi-head attention’s linear transformation is not directly fed into the next layer. Instead, it is first added to the original input of multi-head attention block through residual connection. This helps in mitigating the vanishing gradient problem in deep networks and allow for more effective training of deeper models by providing an alternative pathway for the gradient during backpropagation.

After the addition, the combined vector is normalized across its features. The purpose of this normalization is to stabilize the learning process and to help the model train faster and more effectively. It ensures that the distribution of the activations remains consistent across different layers, which is particularly important in deep networks like Transformers.

  • Feed Forward Network

The typical structure of this feed-forward network consists of two linear (fully connected) layers with a non-linear activation function in between. The first linear layer expands the dimensionality of the input, often to a higher dimension than the original embedding size. This expansion allows the network to capture more complex features. The activation function, usually ReLU (Rectified Linear Unit) or a similar non-linear function, introduces non-linearity into the model. This non-linearity is crucial because it allows the network to learn and model more complex patterns in the data. The second linear layer then projects this output back to the original embedding dimension, making the output of the feed-forward network compatible with the input dimension of the layer and suitable for further processing in subsequent layers.

The feed-forward network adds another layer of abstraction and complexity to the data. While the self-attention mechanism helps the model understand the relationships between different words in a sentence, the feed-forward network helps in learning and modeling more complex features that are not explicitly related to word-to-word interactions. This component contributes to the overall ability of the Transformer model to handle a wide range of tasks in natural language processing, as it processes each word (or sub-word) embedding with a deeper and more complex set of transformations.

The Encoder Block

The output of feed forward network is passed again to layer which consists of residual connection and layer normalization block. The is the output of the encoder block. The output of the first encoder block becomes the input to the second encoder block, and this process repeats. Each block builds upon the previous one, progressively refining and re-contextualizing the representation of the input sequence. As the sequence flows through multiple encoder blocks, it gains more complex and abstract features, with higher blocks capturing more sophisticated aspects of the data.

Decoder

Stage 1: The Embedding Stage

The input passed to the decoder varies depending on whether it’s training phase or inference phase.

Training Phase: During training, the decoder receives the target sequence in a slightly modified form. For instance, if we’re training a model for English to French translation and the input is “I am Shashank”, the target sequence might be something like “<START> Je suis Shashank <END>”. The decoder is typically provided with the entire target sequence up to the current token but excluding the token being predicted. This is often referred to as “teacher forcing.” It means the input to the decoder at each step would be as follows:

  • When predicting “Je”, the input is “<START>”
  • When predicting “suis”, the input is “<START> Je”
  • When predicting “Shashank”, the input is “<START> Je suis”

and so on. This approach enables parallel processing of the sequence, as the decoder can handle the entire sequence at once, with each step predicting the next token based on the previous tokens.

Inference Phase: In inference mode, such as when translating new sentences, the process is sequential and iterative. The decoder generates one token at a time and uses its output as part of the input for the next step. This is a step-by-step process:

  • Initially, the decoder starts with the “<START>” token and generates the first word, “Je”.
  • This first word is then fed back into the decoder along with the “<START>” token to generate the next word, “suis”.
  • This process continues, with each newly generated word being added to the sequence of inputs for the decoder, until the “<END>” token is produced or a maximum length is reached.

In contrast to the training phase, this process is inherently sequential and cannot be parallelized, as each step’s output is needed for the subsequent step.

The remaining process of converting predicted token into embedding and adding positional embedding remains same.

Stage 2: The Decoder Block

Masked Multi-Head Self Attention

The masked multi-head self attention stage makes sure that the prediction for a particular token only depends on the known preceding tokens, adhering to the autoregressive property. This masking ensures that, for example, when the model is predicting the third token in a sequence, it cannot ‘see’ the fourth token or any tokens that come after it. Lets understand how this works:

  • Training — During training, even though the model has access to the entire target sequence (up to the current token being predicted), we still need to ensure it does not use future information — tokens that come after the current token being predicted. This is where masking comes into play and the method is called Teacher forcing during training. In teacher forcing, the entire target sequence (up to the current point) is fed into the decoder. For instance, if the target sequence is “Je suis Shashank”, and we are currently training the model to predict “suis”, the input to the decoder would be “<START> Je”. However, since we’re processing the sequence in parallel during training (for efficiency), the model technically has access to the entire sequence. Without masking, the self-attention mechanism would allow “Je” to attend to “suis” and “Shashank”, which is not desirable as it would be using future information. Future words like “suis”, and “Shashank” are not connected, visually representing the mask preventing attention to future words. This mask assigns a very low value (like negative infinity) to all positions that a token should not attend to. When the softmax is applied, these positions effectively get an attention score of zero, making them non-contributing to the final output.
  • Inference — In the inference phase, the decoder generates the output sequence one token at a time, and each newly generated token is used along with the previous tokens to generate the next one. Even though it is naturally sequential and does not have future tokens per se, the “Masked Multihead Self-Attention” mechanism is still used. This maintains consistency with the training phase and adheres to the model’s architecture. The masking, in this case, continues to enforce the autoregressive property by ensuring that at each step, the attention mechanism only focuses on the sequence generated so far, without any ‘look-ahead’ capability. The use of masking during inference is more about maintaining the architectural integrity and the learned behavior of the model rather than a practical necessity to hide future information.

After the masked multi head attention comes the Add & Norm for residual connection and layer normalization. It’s functionality is same as that in encoder.

Multi-Head Attention

This layer is often referred to as the “Encoder-Decoder Attention”. It plays an important role in integrating important information from the encoder to decoder (as visible in the above image). It allows each position in the decoder to attend to all positions in the encoder’s output. In this layer, the ‘Query’ (Q) vectors come from the previous layer’s output (which is the output of the masked multi-head self-attention layer). The ‘Key’ (K) and ‘Value’ (V) vectors come from the encoder’s output. This configuration enables the decoder to focus on different parts of the encoder output, which is the representation of the input sequence.

In the context of machine translation, this mechanism allows each word in the translation (decoder output) to consider the entire input sentence (encoder output) and figure out which words (or parts) of the input sentence are most relevant. For example, when translating “I am Shashank” to “Je suis Shashank”, as the decoder generates each French word, this layer helps it to focus on the corresponding relevant parts of the English input.

The attention mechanism here is similar to that in the encoder and the masked self-attention layer of the decoder: computing scaled dot-product attention, combining the results of multiple ‘heads’ (multi-head attention), and then passing the combined output through a linear layer.

Because of this allows the interaction between encoder and decoder output, for this reason its also called cross attention.

Feed Forward Network

The feed-forward network (FFN) in the Transformer’s decoder after the cross-attention (or encoder-decoder attention) layer is essentially the same as the one in the encoder. Each layer of both the encoder and decoder contains its own feed-forward network, and they function similarly in terms of architecture and purpose.

Decoder Blocks

If the Transformer model has multiple decoder layers, the output of this layer is then fed into the next decoder layer as its input. This process repeats identically for each decoder layer: starting with masked multi-head self-attention, followed by encoder-decoder attention, then the feed-forward network, and finally the residual connection and normalization. Here, I would also like to point out that the number of encoder and decoder blocks are same.

Linear Transformation

The output from the final decoder layer consists of a sequence of vectors. Each vector corresponds to a token from the target sequence and encapsulates a rich, contextually informed representation. These vectors are then passed through a linear transformation (a fully connected layer). This layer projects each vector into a higher-dimensional space that corresponds to the size of the model’s vocabulary.The purpose of this projection is to prepare the vectors for the subsequent probability distribution over all possible tokens.

Softmax

Following the linear transformation, a softmax function is applied to each vector. The softmax function converts the raw scores (logits) from the linear layer into a probability distribution. The output from the softmax layer for each position in the sequence is a vector where each element represents the probability of a specific token from the vocabulary being the correct next token in the output sequence.

Token Selection

For each position in the sequence, the token with the highest probability is selected as the output for that position. This step is what actually generates the sequence. During training, this generation process can leverage the entire sequence at once. But during inference, it’s typically done one token at a time, in an autoregressive manner.

Sequence Generation During Inference

During inference (e.g., translating a new sentence), the model starts by inputting a start token (”<START>”) and generates one token at a time. Each newly generated token is then added to the sequence and fed back into the decoder for generating the next token. This process continues until an end token (“<END>”) is generated or a maximum sequence length is reached.

Conclusion

In summary, the Transformer architecture marks a paradigm shift in machine learning, particularly in the realm of natural language processing. Its unique architecture, characterized by self-attention mechanisms and the absence of recurrent layers, allows for unparalleled parallelization and scalability. This has not only led to significant improvements in efficiency but also opened new avenues for tackling complex linguistic tasks that were previously intractable.

The implications of the Transformer extend beyond mere technical advancements. By enabling more nuanced and context-aware language models like BERT and GPT-4, it has bridged the gap between human and machine understanding of language, ushering in a new era of AI that is more interactive and adaptable.

--

--