Understanding Transformer Architecture Using Simple Math

10 min readMar 8, 2024

Traditional models like RNNs and convolutional networks struggle with understanding long-distance relationships in sequences. To address this challenge, the Transformer model was introduced.By employing self-attention, it enables the model to examine all parts of a sequence simultaneously, enhancing its ability to comprehend both short and long-term connections. This represents a significant advancement in sequence processing capabilities.

In this article, we’ll walk through a practical example showcasing the transformative power of Transformers in natural language processing.We’ll start with a basic English sentence and see how a Transformer model effortlessly turns it into Hindi. We will see how the math behind Transformers makes language translation.

Example text :”Cat is sleeping on the mat”.

1. INPUT EMBEDDING:

In simple terms, Transformers, which are a type of machine learning model, don’t naturally understand human language like we do. So, to help them understand, we take each word in a sentence and turn it into a series of numbers that represent its meaning. These numeric representations, called vectors, capture what each word signifies. Then, we use these vectors as input to the part of the system that processes and understands the language, known as the Encoder.

The original paper uses big 512-dimensional vectors for each word, but we’ll keep it simple with smaller 5-dimensional vectors so it’s easier to understand how the math works visually.

2. POSITIONAL ENCODING:

Positional encoding helps the Transformer model understand which word appears at which position in the sequence.

In Transformers, we use positional embeddings to help the model understand the order of words in a sentence. Unlike recurrent neural networks, which go through words one by one, Transformers process them all together, so they need help knowing the order. Positional encoding provides this information by assigning unique values or embeddings to each position in the sequence, allowing the model to accurately interpret the order of words.

There are two formulas for positional embedding depending on the position of the ith value of that embedding vector for each word.

As our input text is “cat is sleeping on the mat.” We’ll focus on the word “cat” and calculate its positional embedding.

Similarly, we can calculate positional embedding for all the words in our input sentence.

3. Adding Word Embeddings and Positional Embeddings:

Now, we add word embeddings and positional embeddings to combine both the semantic meaning of the word and itspositional information. This combined representation helps the Transformer model understand not only the meaning of each word but also its position within the sequence, ensuring accurate processing and interpretation of the input text.

This combined representation serves as the input to the Encoder in the Transformer architecture.

4.Multi-Head Attention:

Multi-head attention is a mechanism used in Transformer models to capture different aspects of relationships between words in a sequence. It enables the model to focus on different parts of the input sequence simultaneously and learn different representations of the input.

Let’s understand this with an example sentence. “I need to visit the bank tomorrow to withdraw some cash and then go to the picnic by the river bank to enjoy.”

In the context of the given sentence, “bank” has two distinct meanings: one related to the financial institution (“bank”) and the other related to the riverside (“riverbank”)

Let’s say we have two attention heads. Each head can focus on a different aspect of the input sentence. In this case:

  • One attention head might focus on the word “bank” related to the financial institution, trying to understand the context of visiting a bank to withdraw cash.
  • Another attention head might focus on the word “bank” in the context of the riverbank, understanding the intention of going to the picnic by the river.

By utilizing multiple attention heads, the model can learn various interpretations of the word “bank” within different contexts at the same time.Thus, Multi head attention enhances its ability to understand and process complex language patterns.

Multi head attention layer contains many single head attention. Below is an illustrated diagram of how a single-head attentions looks like.

Let’s explore the concept of single-head attention. This mechanism takes inputs of query, key, and value. It involves the embedding of the input and positional information, which are then multiplied with three distinct linearly weighted matrices to generate query, key, and value representations, respectively. Initially, these matrices are randomly initialized, and their values are adjusted during training based on the model’s learning process.

The attention scores are calculated using the formula:

Now , Multiply query matrix with the transpose of key matrix

For scaling, we divide the resultant matrix of query multiplied by the transpose of the key by the square root of the word dimension. In our case, where the dimension of the word is 5, this means dividing by square root of 5

The next step of masking is optional, and we won’t be calculating it. We will be using masking in the decoder part.

Now we apply softmax function to the scaled resultant matrix , it transforms all the raw scores into values between 0 and 1.

Now, we obtain the final resultant matrix of single-head attention by multiplying the softmax matrix with the value matrix.

We have calculated single-head attention, while multi-head attention comprises many single-head attentions, as I stated earlier. Below is a visual of how it looks like:

The resultant matrices of all single-head attention are concatenated together and multiplied by the linear weights to obtain the output of multi-head attention. In this process, the linear weights are used to transform the concatenated matrix back to the original size of the matrix.

For example, if the output of a single-head attention matrix is 5x4, and let’s consider we have three attention heads, it produces three 5x4 matrices. These matrices are concatenated together. The size of the concatenated matrix will be 5x12. Then, this concatenated matrix is multiplied by the linear weight matrix of size 12x4 to obtain the resultant matrix of the original size, which is 5x4.

As I mentioned earlier, the linear weights are randomly initialized at first but are learned during training.

The size of the Multi-Head Attention (MHA) matrix should match the size of the input embedding + positional encoding matrix, as we perform addition and normalization in the next step.

5. ADD & NORM:

The process of combining the resultant matrix of Multi-Head Attention (MHA) with the input embeddings and positional encodings through addition is known as a residual connection. This technique is crucial for tackling the vanishing gradient problem. By integrating both the original information and the attention-enhanced details(output matrix ) from the MHA mechanism, the model gains a richer understanding of the input sequence, ultimately enhancing its performance .

We add the resultant matrix of multi-head attention with the Input embedding + position encoding matrix.

To Normalize the matrix, firstly we will calculate the Mean and Standard Deviation for each row in the matrix.

We normalize the resultant matrix with the help of the formula as shown.

Here we add a small value of error to prevent the denominator from being zero to avoid making the entire term infinity.

6. Feed Forward Layer:

After normalizing the matrix, it is passed through a feedforward layer, which typically consists of a linear layer followed by a ReLU activation layer. This feedforward network introduces non-linearity and captures complex patterns in the data.

In the real world, feedforward layers often include multiple linear and ReLU activation function layers. However, for the sake of understanding, we’re simplifying it to just one linear layer followed by one ReLU activation layer.

According to the formula of a linear layer, the resultant normalized matrix is multiplied by a weight matrix and then added to bias weights. Initially, both the weight matrix and bias weights are randomly initialized and then adjusted during the model training process.

  • Linear Layer = X . W+ b
  • RELU(X) = max(0, X)

After calculating the linear layer, we need to pass it through the ReLU layer. It is used to remove the negative values in the resultant matrix.

7. ADD and NORM:

Once we obtain the resultant matrix from the feed-forward network, we have to add it to the matrix that is obtained from the previous add and norm step, and then normalize it using same formula that we used before.

Congratulations! We have completed the Encoder part of the transformer architecture.

Now, the output matrix of the encoder (resultant matrix of add and norm) is given as the query and key to the decoder’s multi-head attention.

“The resultant matrix of the Encoder contains all the features of the given input text, which can be understood by the Decoder”.

8. DECODER :

Now we’ve finished the Encoder, let’s try to understand the decoder part. The good thing is that most of the Decoder’s functionality mirrors that of the Encoder, with the main difference being the inclusion of masked multi-head attention. Since many topics have already been covered, we’ll focus specifically on understanding masked multi-head attention and the decoder’s output.

9. INPUT TO DECODER:

At the beginning, we provide the <start> token vector as input to the decoder. This start vector is then converted into value matrices for the Multihead Attention. Additionally, the resultant matrix of the encoder is passed as both the Query and Key matrices to the decoder’s Multihead Attention.

After that, text is generated from the decoder, and this generated text is concatenated with the start token before being passed as input to the decoder again. This process continues iteratively until the decoder generates the <end> token.

I understand that this concept may be difficult to understand. To make things clearer, I have included a visualization at the end of the article. Please take a look at it for the perfect picture.

10.MASKED MULTI HEAD ATTENTION:

Masked multi-head attention combines the principles of multi-head attention and masking. It operates similarly to multi-head attention but with the addition of masking to ensure that during decoding, each token can only attend to previous tokens and not future ones. This prevents the model from “cheating” by looking ahead during generation.

The masking is typically achieved by applying a mask to the attention scores before softmax normalization.

To apply masking in masked multi-head attention, very large negative values (typically negative infinity) are added to the attention scores corresponding to the positions that should be masked.

This ensures that when the softmax function is applied, the masked positions effectively have zero probability of being attended to, preventing the model from attending to future tokens during decoding.

After softmax following procedures operate similarly to multi-head attention.

11. Flattening:

After obtaining add and Norm output from the Decoder the resulting matrix is flattened. These flattened values are then passed through a linear layer to compute the logits.

After computing the logits, the model typically applies a softmax function to convert these logits into probabilities. The model then selects the word with the highest probability as the next word in the generated sequence.

This generation process continues iteratively until the <end> token is generated.

Finally, we’ve reached the end of our journey through the architecture of the Transformer model! It’s been an amazing exploration, and now, as promised, here’s a vibrant overview visualization of Transformer with an example.

Please do clap 👏 or comment if you find it helpful
If you have any queries feel free to ask us! at

sumith.madupu123@gmail.com or mulukuntlabhikshapathi@gmail.com

This blog is written by Sumith and Bhikshapathi

--

--

Responses (6)