A detailed simplified explanation of the Transformers architecture.

Abdullah Afify
7 min readOct 7, 2023

--

A detailed and simplified explanation of the Transformers architecture that has revolutionized the fields of Natural Language Processing (NLP) and Computer Vision (CV) since its introduction in the 2017 paper titled “Attention is All You Need” is as follows:

The Transformer is a deep learning model built upon a concept known as the self-attention mechanism. This mechanism evaluates each part of the input data differently from other parts. Initially designed to process sequential input data, such as natural language, it replaces traditional techniques like Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks, outperforming them by allowing the model to process the entire input simultaneously.

The key innovation in the Transformer architecture is the attention mechanism, which enables the model to focus on the most relevant parts of the input for each output. In the field of NLP, this means that the Transformer doesn’t need to process data word by word; instead, it can handle them in parallel. This parallel processing significantly reduces training time and enables the creation of powerful pre-trained systems like BERT and GPT.

In essence, the Transformer architecture’s ability to perform self-attention and process data in parallel has led to its widespread adoption and its use in AI applications like ChatGPT and Bard AI, demonstrating its remarkable capabilities in understanding and generating human-like text and extending its utility beyond NLP into areas like computer vision.

Now, let’s dive into an in-depth explanation of the Transformer architecture itself.

The Transformer architecture is divided into two main sections: the Encoder and the Decoder, and it doesn’t rely on recurrence or convolutions to produce output.

1. Encoder Layer:

The Encoder is the first part of the Transformer architecture, and it consists of four sub-layers as follows:

  • Input Embedding Layer: This layer is responsible for converting sequences of tokens into a sequence of vectors. These vectors represent the semantic meaning of the tokens, allowing the machine to understand and work with them.
  • In this layer, each token is mapped to a high-dimensional vector space, where each dimension represents a specific feature of the token. For example, one dimension might represent the part-of-speech, while another dimension represents the semantic meaning, and so on. Typically, pre-trained embedding spaces like “GloVe” are used to save time and effort.
  • The implementation of this layer involves a simple linear transformation. Each token in the input sequence is multiplied by a learned weight matrix to produce the corresponding embedding vector. These weights are created during training through a suitable objective function, such as cross-entropy loss or mean squared error.
  • In summary, the process of this layer can be described as follows:

Token -> [embedding] -> meaning (vector)

This Input Embedding Layer is the first step in processing input data within the Transformer architecture, and it plays a crucial role in understanding the semantics of the input tokens.

2. Positional Encoding Layer:

Since the Transformer architecture lacks inherent information about the sequence order, unlike RNNs and CNNs, it relies on the attention mechanism, which we will explain shortly. However, because the same token in different sentences can have different meanings, the presence of the positional encoding layer is crucial.

The task of this layer is to transform the meaning vector into a context vector by incorporating information about the distances between tokens in the sequence, known as positional information.

The implementation of this layer involves adding a set of fixed sinusoidal functions with varying frequencies and phases to the input embedding vector. These sinusoidal functions provide a unique representation for each position in the input sequence and are added to the corresponding embedding vector.

This addition ensures that the attention mechanism can differentiate between tokens based on their relative positions in the input sequence.

In summary, the process of this layer can be described as follows:

Meaning (Vector) -> [positional encoding] -> context (vector)

The Positional Encoding Layer is essential for enabling the Transformer to consider the relative positions of tokens when processing input sequences, allowing it to handle various meanings for the same token in different contexts.

3. Multi-Head Attention Layer:

This layer forms the foundation of the entire architecture and enables the model to selectively focus on different parts of the input sequence, allowing it to learn complex dependencies and relationships between them.

It performs self-attention, meaning that tokens examine each other within the input sequence. Its main goal is to generate a context-aware representation for each token in the input.

This layer is composed of multiple attention heads, each independently calculating a weighted sum of the input embeddings based on their relationship with a specific query vector.

We can break down the computation into three steps:

Step 1: Query, Key, and Value Computation (QKV): In this step, the input embeddings are transformed into query, key, and value vectors using learned weight matrices. These vectors are then used to calculate the attention scores between each query vector and all key vectors.

Step 2: Attention Weights Computation: Here, the attention scores for each query vector and all key vectors are normalized using the softmax function. This produces a set of attention weights for each query vector, determining the relative importance of each value vector in computing the final output.

Step 3: Weighted Sum Computation: In this step, the value vectors are multiplied by their corresponding attention weights and then summed to produce the weighted representation for each query vector. This weighted representation is connected and projected back to the original dimensionality of the input embeddings using another learned weight matrix.

In summary, this layer is responsible for focusing on any part of the input that needs attention, allowing the model to understand multiple aspects of the input sequence simultaneously. The result of this layer is a set of attention vectors for each token, which are then averaged to produce a single attention vector for each token, as the subsequent layers accept only one vector per token.

In summary, the process of this layer can be described as follows:

Context vector -> [attention] -> attention vector per token

The Multi-Head Attention Layer is a critical component of the Transformer architecture, enabling it to capture complex relationships and dependencies within the input data.

4. Feed-Forward Layer:

This layer is a simple feed-forward layer applied to each attention vector individually. Its role is to transform the output vector into a format that is easy to work with, whether it’s for the next decoder block or for the subsequent linear layers.

In summary, the process of this layer can be described as follows:

Attention vector (parallel) -> [Feed-forward] -> set of encoded vectors for every word

The Feed-Forward Layer serves as an intermediary step in the Transformer architecture, preparing the output for further processing in subsequent layers or blocks.

………

Let’s continue with the second section: the Decoder Layer. It consists of seven sub-layers, which we will discuss as follows:

1. Output Embedding Layer:

This layer is similar to the input embedding layer, serving the same purpose.

2. Positional Encoding Layer:

We have already explained this in detail in the previous section.

3. Masked Multi-Head Attention Layer:

Considered the first attention layer, it resembles the multi-head attention layer, but it includes a mask for tokens that will appear later in the sequence. It transforms these tokens into zeros, preventing the attention network from using them in the mapping process. This allows the model to produce one output token at a time, considering only the previously generated tokens.

4. Multi-Head Attention Layer:

Here, we can call it the second attention layer, and we previously explained it in the Encoder section. However, in this case, it’s used for encoder-decoder attention, where the target token looks at the source tokens, rather than at each other.

5. Feed-Forward Layer:

We discussed this earlier in detail as well.

6. Linear Layer:

This layer is essentially another feed-forward layer, and its role is to expand the dimensions to match the number of tokens in the output data.

7. Softmax Layer:

Its role is to convert the output into a probability distribution, making it humanly interpretable. The final token is the one with the highest probability, and it becomes the next predicted word in the sentence.

In summary, the Decoder Layer is composed of these seven sub-layers, each with its specific function, ultimately working together to generate the next predicted word in the sentence with the correct context and sequence information.

A note, whether in the Encoder or Decoder sections, after each layer, a type of normalization is applied, such as batch normalization. Its role is to facilitate optimization, especially when using larger learning rates.

In the case of layer normalization, normalization is performed for all features instead of each sample. This is more advantageous in terms of stabilization.

By this explanation, we have completed the detailed description without using equations and numbers to avoid complexity.

By the way, to create BERT, a stack of encoders was used, which is why it was named “Bidirectional Encoder Representation from Transformers” (BERT). To create GPT, a stack of decoders was used, and it is called “Generative Pre-trained Transformer” (GPT).

--

--

Abdullah Afify

Passionate about science & tech, studied programming, bioinformatics in college. Experienced in data science (NLP, CV) & web dev (front-end, back-end)