Transformers from Scratch — Part I

5 min readJun 11, 2024

In this blog series, I will walk you through the first Transformers paper “Attention is all you need” while giving you some background on the motivation for this paper. We will also implement the entire model in PyTorch from the ground up. At the end of this series, you should be clear on the concepts of attention, transformers, and the basic workings of an LLM(Large Language Model).

The first part of this series covers the attention mechanism, which is the building block of the transformer architecture.

Attention

The concept of attention is older than transformers, it was introduced much earlier for usage with RNNs, LSTMs, and other time series models. The primary usage of attention was for Sequence to Sequence modelling, for tasks such as neural translation.

The RNN attention mechanism is used in an encoder-decoder setting for tasks such as neural machine translation. A typical example of this task would be translating from a language like English to German. The way to do this was to first encode the English sentence using an encoder network, and then use that encoded representation to generate German, using a decoder network. I assume here that the readers are familiar with the basics of how RNNs, neural networks, and gradient descent work. If not, there are great resources here https://d2l.ai/chapter_recurrent-neural-networks/index.html for it.

This diagram illustrates the typical encoder-decoder flow.

We see here that an input sequence [x_1, x_2, ..., x_n] is encoded into a new sequence [e_1, e_2, ..., e_n] using an encoder network. The last of these encoded outputs, e_n, then serves as the initial hidden state(context) for the decoder network. We typically use a start token, s_0 here, to start generation of our output sequence. The sequence is then generated in an auto-regressive manner (o_1 is used to generate o_2 and so on) till we get an end token.

The problem with this setting is that a lot of information (the entire input sequence) is encoded into the last encoder token [e_n]. This serves as an information bottleneck. Condensing all the input sequence information into a single context vector can lead to a loss of crucial information, especially for long and complex input sequences. These models also perform poorly on long input sequences due to the degradation of the context vector’s representational capacity over time.

Another major problem is that alignment is not inherent in these models and needs to be inferred by the model during decoding. To give an example, the sentence “can you please open the door?” in English, translates to “peux-tu s’il te plaît ouvrir la porte?”. We can see here that the sentences are not a simple one-to-one mapping of words from English to French. The English sentence has 6 words, whereas the French one has 8. Even the ordering of the corresponding words varies in some cases. These issues are typically solved by having alignment. Attention mechanisms explicitly learn alignments between input and output sequences, improving the model’s ability to generate accurate and contextually appropriate outputs.

Attention solves these problems by taking a weighted average of all the outputs of the encoder and using this as the context for the decoder. The question that needs to be answered is how these weights are to be calculated. The answer is to use a similarity score of the decoder input vector to each of the encoder outputs and use that as weights.

The decoder's initial context would then be defined as:

Where S(s_0, e_i) is the similarity score between the starting input token and the encoder outputs. The input to the decoder would now be the output of the decoder from the previous timestep and this context vector (they are generally concatenated).

For the next input to the decoder, the values that we take a weighted average of will remain the same (the encoder outputs), but the weights will change based on the new decoder input (o_1). We can call o_1 as the query vector. The vectors we use to calculate the weights ( [e_1, e_2, ..., e_n]) are called the keys.

Putting this all together, let's see what the architecture looks like with the attention mechanism

Encoder-Decoder with attention mechanism

Notice how, at each time step, we take all the encoder outputs and calculate the context vector using the decoder input. We then concatenate the context to the decoder input.

However, we still have not defined the similarity function S(x, y). There are many ways this function can be defined. People have used dot products with different activation functions like tanh or in some cases small neural networks that predict this value. The most widely used approach is to take a dot product and then apply the Softmax function to convert the values into weights between 0 and 1.

We will redefine attention at this stage using vector mathematics because it is more easily implementable using typical neural network operations. We define the decoder input as the query vector q. The encoder outputs will be both the values and the keys that will be used to calculate the context. Therefore the values matrix is defined as V = [e_1, e_2, ..., e_n]. Similarly, the keys matrix would be K = [e_1, e_2, ..., e_n] .

The attention operation would then be defined as:

Vectorized attention

Note that during model training we already have the decoder inputs, since they are not autoregressively generated, in that case, this operation can be further vectorized by using a matrix Q containing all the decoder inputs.

The softmax function is applied on each row vector in the output matrix, this gives us a probability distribution as the output, which can be used as the weights for the values.

To summarise:

Typical encoder-decoder architectures suffer from an information bottleneck that loses information when encoding long and complex sequences.
To rectify this, attention adds context to the decoder input by taking a weighted average of the encoder outputs (values) at every step.
These weights are calculated using the similarity of the decoder input (query) to each of the encoder outputs (keys).
The similarity is calculate by taking a dot-product of the query and keys and then applying the softmax function.
Softmax gives us a probability distribution that we use to weigh the values and create a context vector.
The context is concatenated to the decoder input and then passed as input to the decoder.

Transformers from Scratch — Part I

Attention

To summarise:

Written by Chris Andrew