Unpacking Transformers: A Deep Dive into the Full Architecture — Part 4

10 min readMay 13, 2023

In Part 3 of this series, we took a deep dive into concepts such as Positional Encodings, Multi-Head Attention, the addition of non-linearities, and Masked Multi-Head Attention and looked into how they can overcome the issues/problems of naive self-attention (Part 2 of the series), which need to be resolved for building a fully functional Seq2Seq model for generation tasks based on transformers

(link to part 1)

(link to part 2)

(link to part 3)

Our goal was to explore how we could shift from the traditional Recurrent Neural Network (RNN) based Seq2Seq models to a new paradigm that leverages attention layers for generation tasks — tasks typically handled by RNNs (Part 1 of the series). Finally, in this blog (part 4), we will look at the complete transformer architecture from the original paper (Attention is All You Need’ by Vaswani et al)

Let’s Dive in!!

To reiterate, in part 3 of the transformer series, we did the following 4 things:

Used positional encodings (on the inputs) to make the model aware of the relative positions of tokens
Use multi-head attention
Alternate self-attention “layers” with nonlinear position-wise feedforward networks
Use masked attention if you want to use the model for future-generation tasks

Now, let us try to pick up on the components, and try to build the complete transformer which will be the equivalent of the Seq2Seq RNN-based model (discussed in blog 1).

A brief recap of the Seq2Seq model

The seq2seq model comprises two main components: an encoder and a decoder, both of which are recurrent neural networks (RNNs).

In a seq2seq model, the encoder processes the input sequence, one element at a time, and generates a fixed-length hidden state representation. This hidden state captures the information from the input sequence. The decoder, on the other hand, takes this hidden state and generates the output sequence element by element, conditioned on the previous output and the hidden state.

Figure 1: An example of a sequence-to-sequence encoder-decoder RNN network.

We will build an encoder and decoder model (equivalent of the Seq2Seq model) using the different components of transformers that we have learned so far!!

Building the basic encoder layer of the transformer architecture

Stacking up all the components, that originated because of the problems in naive-self attention, we have the encoder equivalent of the RNN Se2Seq model (Figure 1) encoder shown in Figure 2.

Figure 2: Basic encoder layer of the transformer architecture

The following are the components:

The encoder gets sequences of x_t inputs, which it passes through some embedding function (e.g., linear, followed by a non-linear layer).
The positional encodings at a particular time step are added to the embedding of the inputs x_t.
The output from point 2 is further passed into multi-head attention. The following are its components:

There are 8 heads in 1 multi-head attention.
Explanation of 1 head out of 8: there will be key, query, and value pairs for each time-step t.
Each head will produce output attention for each time step (for example, head 1 will produce attention a(l, head), where l is the time step and head is the head number
Output from all 8 heads for each time-step (l) will be concatenated to give an overall attention score at a particular time step l.

4. The attention scores produced at an every-time step from the multi-head attention layer would be passed to a non-linear layer also called as position-wise non-linear layer or feed-forward layer.

5. There are 6 multi-headed attention and corresponding feedforward layers that are used in the transformer's paper.

Building the basic decoder layer of the transformer architecture

Figure 3: Basic decoder layer of the transformer architecture

All the components of the decoder are similar to the encoder components, with two main differences:

Each layer of multi-head attention now uses masks (as discussed in blog 3), and therefore now referred to as Masked Multi-head attention
There is another layer called as cross attention layer in which queries are answered by the keys of the encoders. For example, the query of time-step 1 (of cross attention) will be answered by the keys of time-step 1 of the last layer of the encoder.
After 6 repeated blocks, at the very end, the softmax is applied at every time-step (t) to get the output.
There is a final softmax layer applied at every time step to get to the output.

Finally, connecting encoder layers, with decoder layers, we get Figure 4 !!

Figure 4: Visualising connection between basic encoder and basic decoder through feed-forward network to cross-attention

Figure 4, is an equivalent transformer version of Figure 1. However, there are a few more things that need to be addressed to reach the version of the transformer model (or complete transformer model) that is in the primary paper.

Cross-Attention Layer

The output of the non-linear (position-wise nonlinear) layer/network ‘m’ of the encoder, at time-step ‘t’, is given by equation 1.

The output of the non-linear (position-wise nonlinear) layer/network ‘m’ of the decoder, at time-step ‘l’, is given by equation 2.

The query for the decoder state at time-step ‘l’ is given by equation 3 (explained previously):

The key and value for the encoder state at time-step ‘t’ is given by equation 4 and 5 (explained previously):

The attention score is calculated between every encoder time-step (t) and decoder time-step (l), given by equation 6 (explained previously):

We apply the softmax (equation 7), and moving further we are calculating attention. This time, it is called as cross-attention output, as the queries are from the decoder and keys are from the encoder network (equation 8):

From Figure 4, it should be noted that cross-attention is also multi-headed. So, there will be 8 heads for cross-attention.

So, that’s all for cross-attention!!

The last component which is missing is Layer Normalization…

Layer Normalization

Main Idea: Batch normalization is very helpful, but unlike vision models, it is hard to use with sequence models.

Sequences are of varied lengths so normalizing across the batches is hard
This layer is just like the batch normalization layer, but not across the batches. It is across different activations in a layer instead of different samples in a batch.
Layer-Norm does not use any information across the batch. Instead of computing mean over different d-dimensional vectors in a batch (as done in batch normalization), layer-norm computes a single number (1-d quantity) by averaging together the activations for every dimension in a d-dimensional vector say ‘a’ (equation 9)

Similarly, the standard deviation is the sum over dimensions (d) of a vector, instead of different vectors in a batch (equation 10):

Activations are transformed just as done in the batch norm, subtracted by mean and divided by standard deviation, but element-wise multiplication by gamma and beta. (mean and standard deviation are scalers). It doesn’t share any information across the batch (unlike activation of batch normalization layer)

Assembling complete transformer!!

Now, we have everything to look at the complete transformer. In Figure 4, we have added two more layers to get to the final architecture (Figure 5).

Residual
Layer Norm

Figure 5: Layer Norm and Residual layers added to both encoder and decoder. (classical transformer architecture from the original paper)

Now, since the architecture is complete, Let’s go through all the equations using Figure 5

Let’s start with the inputs and outputs of multi-headed attention. Input and Output to the multi-headed attention are given by a set of equations (equation 12). ‘t’ represents the particular time step. ‘m’ represents the multi-headed attention layer no, which runs from 1 to 6.
At m=0, the input below is just the input embedding and the positional encoding. Output in equation 12, is the output of the self-attention layer with multiple heads (8), r output of the multi-headed attention layer. For single multiheaded attention, the output from all the heads is concatenated, i.e., 8-head attention is concatenated to get to the output in equation 12.

The input to the multi-headed attention is added to the output of the multi-headed attention and LayerNorm is applied over it. So it is a residual connection with the LayerNorm (equation 13).

Then, we have a position-wise non-linear function (or feed-forward network). In the transformer paper, a particular position-wise non-linear function used consists of a linear layer and then ReLU, and then another linear layer (equation 14). The output h at time step ‘t’ for layer ‘m’ (runs from 1 to 6), can be thought of as a pre-layer norm hidden state. So, this is just a 2 -layer neural network at each position.

The output of the feed-forward (equation 14), is passed through another layer norm along with a residual connection (equation 15). The output from equation 15, is passed to the next set of layers (m+1).

Once, all the 6 blocks (layers repeated 6 times) are run through, the output of the last layer (equation 15 with m = 6) is passed to the decoder. So, specifically, multi-headed keys and values for the block (m = 6)is passed to the cross-attention layer. (Equation 16 for keys, equation 17 for values). 6 below represents the block number (m = 6 for the last block). 8 represents the number of heads in multi-headed attention.

Decoder’s multi-headed self-attention is masked, so that for a particular time-step in can just look at the past time-step (specifically when we are working to generate the future sequence).
Masked-multi-headed attention is followed by a layer norm and a residual connection
The layer norm is followed by the cross-attention layer. The cross-attention produces queries, based on the output of the masked multi-head attention. Those queries are used to look up the keys and values from the encoder (equations 16, 17)
Cross-attention is again followed by layer norm and the residual layer. this layer is very important since this layer will add the results of cross-attention and the masked multi-head attention
Then like the encoder, there is a position-wise non-linear layer (or feed-forward layer). It has the same architecture as the encoder, but different weights.
It is followed by another layer norm and the residual layer.
The process above repeats 6 times by passing information from 1 block to the next (as done in the encoder)
The output is derived from the last block (6th block), by applying one linear layer and the softmax layer at every time-step ‘l’, which means the decoder decodes one position at a time with masked attention.

To conclude, in the series of 4 blogs, we started with the Seq2Seq model for RNNs and looked at how to assemble the building blocks for converting RNN based model into an attention-based model, also referred to/popularly known as transformers.

So, that completes the classical transformer architecture from the original paper!!