Demystifying the Transformer Architecture: A Comprehensive Guide

Published in

Crayon Data & AI

25 min readAug 20, 2023

Well here we go, another Transformers article out there…so why do I want to write this article. Since I started using Transformer based models in my work, a few years have passed. Every year I started this process again of understanding the Transformer better and better, and I did. But just to a certain degree. I tried to explain it to colleagues, which helped me to understand it even more.

I do not want to write an article based on historical facts about the Transformer, and neither would I be the best person for this. Also, I do not want to go too much into the mathematical aspect of the Transformer, since I would also not be the best person for this. I will explain the Transformer in such a way that I would understand it, with all those small things I learned over time by reading different papers or checking different codes.

Attention

But let’s start from the beginning. What is the main idea behind the Transformer architecture? First introduced in “Attention is all you need” [1], the idea was to present a simple Attention Based model for a translation task. One very important point of the Transformer architecture (as well as many, many other models) is, that the input Embeddings (we will come to them) share proximity if the inputs to the Embedder are related.

Let's analyze this further. Assume you have two words e.g. King and Queen. Once we compute a numerical representation (a vector in most cases) of these words, the similarity (in our case cosine similarity) between them should be large. Assuming we would compute a numerical representation of the word Moon, one would assume that this representation has less similarity to the word King than the word Queen.

Here we show this in a small Python example:

# Let's assume Embedder returns us a matrix
# embedding for a given word in shape 1x512
emb_king = Embedder("King")
emb_queen = Embedder("Queen")
emb_moon = Embedder("Moon")

# For those who never used it, the @ operator is 
# a matrix multiplication
sim_queen_king = emb_queen @ emb_king.T # 1x512 @ 512x1 = 1x1
sim_moon_king = emb_moon @ emb_king.T # 1x512 @ 512x1 = 1x1

assert sim_queen_king > sim_moon_king

From this small example, we can see that the Embedder is returning Embeddings of size 512. This Embedding can be of any size, the only important thing is, that the Embedding size is the same.

# We can also combine the two embeddings we
# query for to a 2x512
emb_query = concatenate([emb_queen, emb_moon]) # 2x512

sim_query_king = emb_query @ emb_king.T # 2x512 @ 512x1 = 2x1

assert sim_query_king[0] > sim_query_king[1]

But what exactly is this result showing us? We have a query, which in our case is two Embeddings of size 512, stacked on top of each other. These Embeddings are multiplied with another (transposed) Embedding of size 1x512. The result is a matrix of size 2x1 which shows the similarity of each Embedding from the query and the Embedding of the work King.

We can for sure not only have one word we want to compute the similarity with, in fact theoretically we can have as many as we want. So let’s do that. We can compute additional Embeddings and stack them together with the Embedding of the word King.

emb_prince = Embedder("Prince")
emb_castle = Embedder("Castle")
emb_sun = Embedder("Sun")

emb_key = concatenate([emb_prince, emb_castle, emb_sun]) # 3x512

sim_query_key = emb_query @ emb_key.T # 2x512 @ 512x3 = 2x3

We called this second matrix Key (which is used in the attention part of the Transformer model too [1]). Additional to Key we already introduced the Query name, which is the first matrix we used. In general, we assume that the Key and the Query Embeddings are in the same Latent Space which makes it possible to compute the similarity between them.

I used the name Latent Space here, and I will use it a lot of times in this article. It is a very important concept, which will come up in different papers (also called: Embedding Space, Feature Space, Manifold, Latent Representation, Low-Dimensional Space, …). A Latent Vector is one point in this Latent Space, which has some information added to it. Let’s assume we have an image and we want to classify this image, we would first compute the Latent Vector of this image, before passing this vector to a Linear Layer and a Softmax, to classify the content of the image. This means, that two images, which share similarities, should also share similarities in the Latent Space, and therefore their Latent Vectors will lie closer to each other than the Latent Vectors of images which do not share similarities. We can compute Latent Vectors for every object and in the case of Transformer architectures we assume, that we will do that all the time, but let’s not get ahead of ourselves.

Let’s refocus. We said: In general, we assume that the Key and the Query Embeddings are in the same Latent Space which makes it possible to compute the similarity between them (it makes sense now right?).

So why do we name them differently? Key and Query come from the Database world. The Query is the values we look for and the Keys are the values we have saved somewhere in the backend. For example, we search for a Hat on a webshop. The Query would be the Embedding of Hat. Our Keys would be the Embeddings of all our products, e.g. Toilet Paper, Socks, Batteries, …

If we multiply the Query and the (transposed) Keys we get a matrix with the similarity for each Key. Now let’s assume we want to get Embeddings as output, which somehow should represent our search. We can have a second database where we store the Embeddings for each of our products. Let’s say that these Embeddings should be more representative and therefore should have a larger dimensionality, e.g. 1024. We can just take our similarity matrix and multiply it with our new matrix of Embeddings (let’s call it Value).

emb_query = Embedder("Hat") # 1x512
emb_key = concatenate([Embedder(prod) for prod in products]) # 500x512
emb_value = concatenate([GoodEmbedder(prod) for prod in products]) # 500x1024

sim = emb_query @ emb_key.T # 1x512 @ 512x500 = 1x500
representative_embedding = sim @ emb_value # 1x500 @ 500x1024 = 1x1024

For sure this is just a simple example that should just represent some idea behind everything, but it already shows a lot of things that are not so easy to grasp (we will explain them later).

For now, however, we have one big problem. The entries of sim could be very small or very big for different Queries, depending on if those are represented nicely or not. To counteract this we can simply normalize the output of the matrix multiplication in such a way that each row sums up to one. The most straightforward way is to use a softmax function. This is not changing anything in the matrix multiplication part, however, it will enforce the new Embedding to be in the same Latent Space as the others, since it should now just be a normalized linear combination of different Latent Vectors.

sim = softmax(emb_query @ emb_key.T) # 1x512 @ 512x500 = 1x500
representative_embedding = sim @ emb_value # 1x500 @ 500x1024 = 1x1024

Now let’s assume, that the Embeddings of Query and Key are independent random variables centered around 0 with a variance of 1. If we compute the dot product (or do matrix multiplication with transposed Key), we would get a similarity map with the values centered around 0 with a variance of their dimensionality. To counteract this, we could just divide every entry by the square root of the dimensionality before applying the softmax. We will use the notion of d_k to say that it is the dimensionality of the Keys we are after, but it should be clear that it could also be used d_q since Keys and Queries need to have the same dimensionality for the matrix multiplication to work.

emb_query = Embedder("Hat") # 1x512
emb_key = concatenate([Embedder(prod) for prod in products]) # 500x512
d_k = emb_key.shape[1]
emb_value = concatenate([GoodEmbedder(prod) for prod in products]) # 500x1024

sim = softmax((emb_query @ emb_key.T)/sqrt(d_k)) # 1x512 @ 512x500 = 1x500
representative_embedding = sim @ emb_value # 1x500 @ 500x1024 = 1x1024

Now that we understood this (hopefully, otherwise just go ahead and re-read this again in a later step, I always have the feeling that re-iterating stuff helps a lot to understand it), we can start to understand the attention mechanism.

Attention defined by [1]

Now if we closely look at the function, we should see everything we already learned. We have a Query matrix of dimensionality NxK a Key matrix of dimensionality MxK and a Value matrix of size MxL.

Query: NxK
Key: MxK
Value MxL

So we see that the dimensionalities which need to be shared are K as the Embedding-Dimensionality for the Query and the Key and M as the number of entries for the Key and the value. The output dimensionality of the Attention should also be no surprise to anyone now and is NxL which is the number of Queries times the dimensionality of the Values.

In [1] this is called “Scaled Dot-Product Attention” and a visual representation can be seen here:

Do not mind the Mask (opt.) step for now. This will be important later.

Up to now, we learned that the Query, Key, and Value matrices can have different (even if related) dimensions. However, there is such a thing as Self-Attention, where the Query, Key, and Value are the same matrices (this makes sense later).

Now let’s implement this function in Python:

def attention(Q, K, V):
  d_k = K.shape[1]
  sim = softmax((Q @ K.T)/sqrt(d_k))
  return sim @ V

Multi-Head Attention

Now that we know how Attention works, we need to understand how the Transformer even can learn new stuff (don’t know if you noticed but up to this point we do not have any learnable weights). Spoiler alert, we will start to use less nice stories like the webshop and so on since in my opinion, it is easier to understand technically now.

We will start to write classes instead of functions. This will represent the real implementations more. We will write these classes in PyTorch-like code but it will still be pseudo code (this will not work out of the box since we are not adjusting it for batch size, register trainable parameters, … but it will be more readable).

class ScaledDotProductAttention(nn.Module):

  def forward(self, Q, K, V):
    d_k = K.shape[-1]
    sim = F.softmax((Q @ K.T)/sqrt(d_k))
    return sim @ V

As mentioned we do not have any learnable weights for the attention. Well, this is easy to fix right? Just add a Fully Connected layer in front. So let’s do this and call this a Single-Head Attention.

class SingleHeadAttention(nn.Module):

  def __init__(self, d_k, d_v):
    self.linear_q = nn.Linear(d_k, d_k)
    self.linear_k = nn.Linear(d_k, d_k)
    self.linear_v = nn.Linear(d_v, d_v)
    self.attention = ScaledDotProductAttention()

  def forward(self, Q, K, V):
    return self.attention(self.linear_q(Q),
                          self.linear_k(K),
                          self.linear_v(V))

Is it that easy? Well, yes it is, at least for now. But let’s understand what this “Single-Head Attention” really means. As we discussed earlier the Query and Key matrices passed to the Attention need to be in the same Latent Space. But now we have one Linear Layer each before the Attention is even computed. This, however, in my opinion, is a genius step. We can have Embeddings that are in a different Latent Space for the Query and the Key, and the Linear Layers learn to transform them into the Latent Space they want. So one Single-Head Attention can bring the Query and Key to a Latent Space it is interested in. So let’s assume we have a Single-Head Attention which is focusing on punctuation in a sentence, the input Latent Vectors (the Query and the Key matrices) are probably in a different Latent Space. Now these two Linear Layers can transform these vectors and bring them into the space we need. Additionally, we have a Linear Layer applied to the Value vectors, which brings them into a Latent Space we want the output to be. But why is the output important?

Well as you maybe noticed we have called this class Single-Head Attention, but the chapter is called Multi-Head Attention. So yes we will combine multiple instances of this Single-Head Attention.

class MultiHeadAttention(nn.Module):

  def __init__(self, d_k, d_v, num_heads):
    self.heads = []
    for _ in range(num_heads):
      self.heads.append(SingleHeadAttention(d_k, d_v))
    self.linear = nn.Linear(d_v * num_heads, d_v)

  def forward(self, Q, K, V):
    heads_out = []
    for head in self.heads:
      heads_out.append(head(Q,K,V))
    heads_out_cat = concatenate(heads_out)
    return self.linear(heads_out_cat)

In this example code, we created num_heads times a Single-Head Attention layer. Each of these is run independently from each other and all the outputs are concatenated. These outputs are further run through a Linear Layer to bring the output back to the same dimensionality of the Value matrix. Now back to the question before, why is the output of each Single-Head Attention important, well all of these should somehow be in a representative space to have the output of the Multi-Head Attention layer representative too.

A Multi-Head Attention layer now can extract even more information than one Single-Head Attention layer. There is also a visual representation of this Multi-Head Attention from [1], with h representing the number of heads.

Encoder

The Multi-Head Attention model is already very nice on its own but with only 3*num_heads Linear Layers to learn from, this will probably not be able to learn difficult tasks like e.g. translation. So we need to chain them, one after another. However, if we chain them, what will the input be? One idea is to just pass the same input to the Query, Key, and Value.

Since we use the same inputs for Attention, we call this Self Attention. This maybe will not make too much sense at first glance. However, let’s change perspective. Let’s assume we get a text as input, where every word is transformed into a vector (alert! simplified!) in the same Latent Space. To extract some information from these vectors we use them as the input to our Multi-Head Attention and we only get one output. Now, let’s recall how Multi-Head Attention works: We pass Query, Key, and Value to multiple Linear Layers which are getting passed to a Scaled Dot-Product Attention layer. The Scaled Dot-Product Attention firstly computed the similarity matrix between the Query and the Value, scales it, optionally masks it, and computes the softmax on top of it, before multiplying it with the Value. The outputs of all the Scaled Dot-Product Attentions are concatenated and passed to an additional Linear Layer.

For the Self Attention part, the only thing that changes is that Query, Key, and Value are exactly the same input. To understand what that means, let’s just look at the Scaled Dot-Product Attention and assume, that these Embeddings all lie on the unit sphere. By computing the similarity matrix we would get the cosine similarity of each Embedding with all every Embedding. This would mean that the diagonal of this matrix is one on each entry since it would be the cosine similarity of the Embedding with itself, but it would share similarities with also other words in the sentence. In this case, the word would share the most similarity with itself but share also similarities with other words in the sentence. This means that the similarity scaled and normalized (with softmax) will show how similar is this word Embedding to all the other Embeddings.

Now this (scaled and normalized) similarity matrix is multiplied by the Value matrix which again is the same input, which will return one Embedding for this one word, which is even more representative than it was before because it not only contains the Latent Information about this one-word Embedding but also some additional information on how this Embedding relates to all the other Embeddings.

As we already, this is not the case, since we already know, that before passing to the Scaled Dot-Product Attention we pass the Query, Key, and Value through a Linear Layer, which makes not only the values not the same anymore, but also would probably move them away from a (hypothetical) unit sphere. About this unit sphere, this was just a helpful tool to make it clearer, however, in this model, we implemented no normalization tool, which would bring them to a unit sphere.

Now we all know that just chaining layers one after another is not helping once we get too deep into the model. The reason is the vanishing gradient problem. However, at least when we heard about ResNet or any other Model using residual connections, we started to understand how we should design our model to have the backpropagation as straight and direct as possible from the output to the input. This is why also in Transformers we see skip connections as demonstrated in the following illustration:

Skip Connection and Normalization from [1]

This also helps us in another way, the word which is passed through the Multi-Head Attention layer has now more information about the whole sentence, however, we still want to keep base information about what we had and just enrich this information with the newly acquired knowledge we got out of the Mulit-Head Attention layer.

Ok, we understand the Multi-Head Attention and the Add part of this image, but what about the Norm part? This is also as easy as it gets. We just pass our output through a Layer Norm which normalizes our Embeddings to be centered around 0 with a variance of 1 (does this trigger anything?). Can we now just chain these layers to each other? Well, we could, but it would not be as good as you may think. As you now know we have the attention mechanism creating more representative Latent Vectors with attention given to all the passed Latent Vectors (even itself). To extract information from such a Latent Vector we need one additional Feed Forward layer which is extracting this information. This Feed Forward Layer is composed of two Linear Layers and a ReLU activation between them (see [1]).

class FeedForward(nn.Module):

  def __init__(self, d_v, bottleneck):
    self.linear1 = nn.Linear(d_v, bottleneck)
    self.linear2 = nn.Linear(bottleneck, d_v)

  def forward(x):
    x = self.linear1(x)
    x = F.relu(x)
    x = self.linear2(x)
    return x

So the whole procedure looks like the following (for the time being, we will add additional stuff):

We call this one an Encoder Layer (since there is an Encoder there will also be a Decoder but we will get to this later). Let’s implement this one in our PyTorch-esque way.

class EncoderLayer(nn.Module):

  def __init__(self, num_heads, d_model, bottleneck):
    self.multi_head_attention = MultiHeadAttention(d_model, d_model, num_heads)
    self.norm_mha = nn.LayerNorm(d_model)
    self.ff = nn.FeedForward(d_model, bottleneck)
    self.norm_ff = nn.LayerNorm(d_model)

  def forward(self, x):
    after_attention = self.multi_head_attention(x, x, x)
    x = self.norm_mha(x + after_attention)
    after_ff = self.ff(x)
    return self.norm_ff(x + after_ff)

One thing we already see is that we removed the dimensionality of the Value and the Key. This is because we are passing the same value all the time to the different steps and therefore we know that we will have the same Embedding-Dimensionality for Query, Key, and Value. In later steps, we will discuss how the Transformer Encoder/Decoder architecture is set up and we will see that we are using the same dimensionality everywhere. However, just because we are using it everywhere, does not mean that we should not be able to reuse a MultiHeadAttention or a ScaledDotProductAttention somewhere else with different dimensionalities (in fact I saw papers building on top of this).

So finally we have finished implementing one Encoder Layer, but what can we do now? Well, as we already mentioned before, we can chain it. As easy as that:

Let’s implement the Encoder now:

class Encoder(nn.Module):

  def __init__(self, num_encoder_layers, num_heads, d_model, bottleneck):
    self.layers = []
    for _ in range(num_encoder_layers):
      self.layers.append(EncoderLayer(d_model, num_heads, bottleneck))

  def forward(self, x):
    for layer in self.layers:
      x = layer(x)
    return x

Positional Encoding

I spend the most time before starting to write this chapter thinking about how to proceed now. My problem was the following. Do I now go and start explaining the Decoder of the whole Transformer architecture or do I explain the Positional Encoding? You read the chapter title so you know what I chose to write first, but why did I do that? I thought that it would be best to understand the complete Encoder and what exactly comes out of the Encoder (spoiler alert, Latent Vectors).

So, what’s positional encoding exactly? Up until now, we did not mention one crucial thing about the Transformer or better the Attention mechanism. This model is designed in such a way, that it can look at every input vector of the input at the same time, but even though this is very good, since for example in sentences proximity is not that important. However, ordering is important:

Don’t wait or you miss the bus.
Wait or you don’t miss the bus.

Sorry for such a bad example but I think it shows what I mean.

Back to positional encoding now. We know why we need it but how can we add this information? Easy, we just add the position of the input vector (e.g. word Embedding) to itself. This way the latent Vector inherits the positional information, and in the Attention layers, this information can be used to extract more representative Latent Vectors.

So how do we implement this now? There are two ways of doing this:

Trainable Positional Encoding
Sinusoidal Positional Encoding

Trainable Positional Encoding is the easy way. One can just have a randomly initialized matrix (with a max context length) and use this one as a positional Encoding:

class TrainablePositionalEncoding(nn.Module):

  def __init__(self, max_context_length, d_model):
    self.PE = randn(max_context_length, d_model)

  def forward(self, x):
    return x[:x.shape[0]] + self.PE

The Sinusoidal Positional Encoding uses a hardcoded approach. For each position, it computes an encoding based on frequencies. The paper [1] defines positional encoding as follows:

This is one possibility, but there can be different ways of doing it. The only important thing is, that the generated Embeddings are all distinct.

class SinusoidalPositionalEncoding(nn.Module):

  def __init__(self, max_context_length, d_model):
    self.PE = zeros(max_context_length, d_model)
    pos = arange(0, max_context_length).unsqueeze(1)
    i = arange(0, d_model, 2).unsqueeze(0)
    self.PE[pos, i] = sin(pos / 10000 ** (2 * i / d_model))
    self.PE[pos, i + 1] = cos(pos / 10000 ** (2 * (i + 1) / d_model))

  def forward(self, x):
    return x[:x.shape[0]] + self.PE

In [1] they explain, that they tested both approaches, but finally selected the Sinusodial Positional Encoding:

We chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training.

Now that we have the Positional Encoding sorted out, we can see how it is added to the Transformer architecture:

Tokenizer

Up to this point, we hinted a few times that this architecture is designed for Text, however, in reality, this is not the only use case. It is a general-purpose one, as long as you can pass in these such important Latent Vectors. In this chapter, however, we will start to explain the Tokenizer, which takes in some text, and puts out some numerical representation of it. The Tokenizer is important to understand in order to understand the Embedder, which we will cover in the next chapter. Now as I said we can use the Transformer also with other domains, e.g. images. There we would not need a Tokenizer, since we do not have tokenizable input.

A Tokenizer takes a text like “I love my cat” and outputs a series of integers (tokens), which is just a numerical representation of this text. It is bi-directional, meaning that you can compute the tokens from the text, but also the text from the tokens.

Let’s write a very (very) simple Tokenizer on our own. The idea is to just use the lowercase alphabet plus some special characters.

class SimpleTokenizer:

  def __init__(self):
    self.vocab = set(
        "abcdefghijklmnopqrstuvwxyz" + 
        "0123456789" + 
        "-,;.!?:'\"/\\|_@#$%^&*~`+-=<>()[]{} ")
    self.vocab_size = len(self.vocab)
    self.char2idx = {u: i for i, u in enumerate(self.vocab)}
    self.idx2char = list(self.vocab)

  def tokenize(self, text):
    return [self.char2idx[c] for c in text.lower().strip() if c in self.vocab]

  def detokenize(self, tokens):
    return ''.join([self.idx2char[t] for t in tokens])

So by tokenizing our text: “I love my cat” we will get the following tokens:
[54, 37, 31, 23, 8, 53, 37, 30, 42, 37, 57, 58, 47]
If we detokenize these values, we will get the text “i love my cat”.

Now, do state-of-the-art Transformers work with such a Tokenizer? No. But the idea is the same. Current state-of-the-art Tokenizers take a subword approach, where (except for some frequent words), sub-parts of a word are tokenized. This is a good tradeoff between enough distinct tokens and not too many of them. Our Tokenizer lives on one side of this spectrum (not enough tokens). On the other hand, a word-based Tokenizer, which assigns one token to each word, would generate too many tokens and make it unfeasible to train.

Embedder

Now back to our Transformer. We always talked about Latent Vectors and now we started talking about tokens, which are single numbers. This is where an Embedder comes in handy. This is “just” a trainable matrix, which has on one hand the vocabulary size of the Tokenizer, and on the other the Embedding-Dimensionality. This matrix is trainable and should learn to compute the first Latent Vectors for each token generated by the tokenizer.

class Embedder(nn.Module):

  def __init__(self, vocab_size, d_model):
    self.embedding = randn(vocab_size, d_model)

  def forward(self, tokens):
    return self.embedding[tokens]

Decoder

Now that we know how the whole Encoder works for a Transformer, we can start to understand the Decoder part of it. This should be easy since there is not much more on the Decoder side of the Transformer than there is on the Encoder one. Let’s change the game and start with the image of the Decoder.

If we look closely we see mostly layers that we already saw in the Encoder, with a few exceptions. We never saw the Masked Multi-Head Attention, the Multi-Head Attention with different inputs, the Outputs (shifted right) part, or the model head (Linear + Softmax). Let’s explain them from the last to the first:

The model head, which consists of one Linear layer and a Softmax is a typical classification head. It will predict, which token should be used next.

The Transformer model is meant a sequence-to-sequence model. The input to the Encoder is a text (e.g. in English) and the output of the Decoder is a second text (e.g. in French). How do we compute this output text? We can do this sequentially, by computing token after token. This is where the Output (shifted right) comes in, but first, let’s discuss some special tokens. Normally, the tokenizer has a few special tokens for special parts of the text. Most commonly start-of-sentence (<sos>) and end-of-sentence (<eos>) are used. The decoder always needs to predict the next token, but what is the input to the first step? Well, it is this special <sos> token.

Let’s have an example. We want to create a chat model, which gets an English text in and predicts another English text. The input text will go through the Encoder and is passed to the Decoder (we explain this later). So let’s assume we want to ask the model the following:
“How do you feel today ?”
We would send the following through the Encoder:
“How do you feel today ?<eos>”

Now what about the Decoder? We will need to get our response now from the Transformer itself right? The Decoder receives the input from the Encoder, and additionally, we will pass in the first token:
“<sos>”

The Decoder now would do its computation and predict the next token (in this example one token is one word and a space):
“ I”

So now that we got the first token from the Decoder we take that, add it to the input sentence of the Decoder, and compute the next possible token:
Decoder(“<sos> I”) -output-> “I am”

The Decoder predicted the second token “ am”. Now let’s continue our predictions:
Decoder(“<sos> I am”) -output-> “I am fine”
Decoder(“<sos> I am fine”) -output-> “I am fine and”
Decoder(“<sos> I am fine and”) -output-> “I am fine and you”
Decoder(“<sos> I am fine and you”) -output-> “I am fine and you ?”
Decoder(“<sos> I am fine and you ?”) -output-> “I am fine and you ? <eos>”

The Decoder predicted the special token <eos> which means that the sentence is finished. So what did we pass to our Decoder?
“<sos> I am fine and you ?”
And the output was:
“I am fine and you ? <eos>”

As you can see, the input to the Decoder is just like the output of the Decoder but just shifted right.

We already mentioned that the Decoder gets the output from the Encoder, but what does that look like? Following we have the whole Transformer architecture shown and you can see, that the output of the Encoder is just passed to the Decoder and used as input in the Multi-Head Attention layer (not the Masked Multi-Head Attention).

You can see, that two inputs to the Multi-Head Attention layer come from the Encoder, but which ones are used? The Query, Key, or Value? If you want to get this on your own, stop reading now and try to remember how Attention works and what we want to have as an output.

So let’s think this through. Let’s again think of a translation task (e.g. English to French). The Encoder already extracts all the information we have in our sentence. The Latent Vectors the Encoder returns should have the information about the sentence without the caveat of being language bound. So this is all the information we have stored somewhere. Now if we think back to our webshop, we talked about all our data saved in the database as the Keys. So for sure, we need to pass the output of the Encoder as the Keys of the Multi-Head Attention layer. And what is the next one? We remember that we said, that the Values are also our whole dataset, but maybe containing even more information. In this case, it is the same amount of information, but it is very good information.

So this means that the Multi-Head Attention layer in the Decoder takes the Key and Value from the Encoder and the Query from the previous layer in the Decoder. But this means that the output of the Multi-Head Attention is already some very information-packed Latent Vector since it uses the Latent Vectors from the Encoder as the Value (ergo the weighted output of the Attention).

In addition, one thing which needs to be understood is, that also the Decoder is added N times in sequence, and to every Decoder layer, the output of the Encoder is passed.

To be honest, this is all you need to know if you only want to know how a Transformer runs. However, you probably are sitting on the edge of your chair while reading this because you know that there is one more thing we did not talk about, even though we hinted at it from the start. The Masked Multi-Head Attention. Let’s recall the Scaled Dot-Product Attention:

We remember that there is this masking operation. Well, let’s see how we can train this Model. We have an input and an output text. In the case of the chatbot it would be:
Input: “How do you feel today ?”
Output: “I am fine and you ?”

We remember the special tokens now:
Input: “How do you feel today ?<eos>”
Output: “I am fine and you ?<eos>”

We also have the shifted right input to the Decoder which in our case is:
Input: “How do you feel today ?<eos>”
sr-Output: “<sos> I am fine and you ?”
Output: “I am fine and you ?<eos>”

Here the Transformer has one trick up its sleeve. It can not only train this whole sentence, but it can train every combination of the outputs at the same time. These are the combinations for this sentence:
“<sos>” — “I”
“<sos> I” — “I am”
“<sos> I am” — “I am fine”
“<sos> I am fine” — “I am fine and”
“<sos> I am fine and” — “I am fine and you”
“<sos> I am fine and you” — “I am fine and you ?”
“<sos> I am fine and you ?” — “I am fine and you ?<eos>”

But how can it do this? Quiet easy to be honest. It will force our similarity matrix, which we get when we matrix multiply our Query without (transposed) Key matrix, to just set the probability that it will use the part of the tokens it never saw, to 0. And since it is set to 0 it will also not pass any derivative information and will not update the weights when backpropagating. This is what the authors of [1] mean by masking. It hinders the part of the model, which should not see the future tokens, to see them.

To set the probability to 0, the values are just set to -Inf before computing the softmax over the similarity matrix. This way the probability still sums up to 1 over the rest of the values but is 0 where we want it to be 0.

Let’s add this masking to our ScaledDotProduct class:

class ScaledDotProductAttention(nn.Module):

  def __init__(self, use_mask=False):
    self.use_mask = use_mask

  def forward(self, Q, K, V):
    sim = (Q @ K.T)/sqrt(K.shape[-1])
    if self.use_mask:
      mask = triu(ones(K.shape[0], K.shape[0]), diagonal=1).bool()
      sim = masked_fill(mask, float("-inf"))
    sim = F.softmax()
    return sim @ V

Here we assume that we added this use_mask parameter to the SingleHeadAttention and the MultiHeadAttention to create masked ScaledDotProductAttention objects.

Since we now have this masking information we can create our Decoder layer and Decoder class too:

class DecoderLayer(nn.Module):

  def __init__(self, num_heads, d_model bottleneck):
    self.masked_mha = MultiHeadAttention(d_model, d_model,
                                         num_heads, use_mask=True)
    self.norm_masked_mha = nn.LayerNorm(d_model)
    self.mha = MultiHeadAttention(d_model, d_model, num_heads, use_mask=False)
    self.norm_mha = nn.LayerNorm(d_model)
    self.ff = nn.FeedForward(d_model, bottleneck)
    self.norm_ff = nn.LayerNorm(d_model)

  def forward(self, x, encoder_output):
    after_masked_attention = self.masked_mha(x,x,x)
    x = self.norm_masked_mha(x + after_masked_attention)
    after_attention = self.mha(Q=encoder_output, K=x, V=x)
    x = self.norm_mha(x + after_attention)
    after_ff = self.ff(x)
    return self.norm_ff(x + after_ff)

class Decoder(nn.Module):

  def __init__(self, num_encoder_layers, num_heads, d_model, bottleneck):
    self.layers = []
    for _ in range(num_encoder_layers):
      self.layers.append(DecoderLayer(d_model, num_heads, bottleneck))

  def forward(self, x, encoder_output):
    for layer in self.layers:
      x = layer(x, encoder_output)
    return x

Transformer

Now that we have all our puzzle pieces we can put it all together to create the Transformer model:

class Transformer(nn.Module):

  def __init__(self,
               num_encoder_layers,
               num_decoder_layers,
               num_heads,
               d_model,
               max_context_length,
               vocab_size):
    self.pos_enc = SinusoidalPositionalEncoding(max_context_length, d_model)
    self.encoder = Encoder(num_encoder_layers, num_heads, d_model)
    self.decoder = Decoder(num_encoder_layers, num_heads, d_model)
    self.linear = nn.Linear(d_model, vocab_size)
    self.token_embedding = Embedding(vocab_size, d_model)

  def encode(self, tokens):
    x = self.token_embedding(tokens)
    encoder_embedding = self.pos_enc(x)
    return self.encoder(encoder_embedding)

  def decode(self, tokens, encoder_embedding):
    x = self.token_embedding(tokens)
    decoder_embedding = self.pos_enc(x)
    return self.decoder(decoder_embedding, encoder_embedding)

  def forward(self, x, y):
    encoder_embedding = self.encode(x)
    decoder_embedding = self.decode(y, encoder_embedding)
    return softmax(self.linear(decoder_embedding))

That’s it? I have to say, it looks straightforward now, right? In my opinion, it also is. I feel that I passed the Valley of Dispair on the Dunning-Kruger curve, and start to fully understand how this architecture works. I will say though that it took me quite some time and I would suggest everyone who still is deep in the Valley probably re-read this or any other article to start to understand the Transformer.

Conclusion

With this article, we just scratched the surface of what we can know, by understanding the most popular model architecture nowadays. Many different domains are built upon the Transformer since it showed how to outperform RNNs, ConvNets in 2D and 3D, …
This is the beauty of this architecture which will probably never stop to amaze me!

References

[1] Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems 30 (2017).