Understanding the building blocks of transformers

Arnab
Analytics Vidhya
Published in
10 min readAug 17, 2021

Transformers have become ubiquitous as a choice of architecture for NLP problems. This article will go through some of the basic background needed to start understanding how transformers work from scratch.

The transformer architecture proposed in “Attention is all you need”

Previous architectures:

To understand why someone would come with such an idea, we can revisit some of the architectures used prior to transformers.

Encoder-Decoder RNN

In 2014, the paper titled “Sequence to Sequence Learning with Neural Networks” introduced the “encoder-decoder” structure that has a RNN to encode a sequence to obtain a fixed vector length representation (in English for example). Similarly, a decoder RNN would take in the fixed vector representation and convert it into french, word by word. This was useful because both the sequences could be of variable length, as it often is during language translations.

An example of an encoder-decoder structure for NMT

In the example above, we can see how “RNN’s really rock” is converted into french. The last “hidden state” acts as the fixed encoding for the entire context of the sentence. The decoder then takes this context, a start token, and predicts over a distribution of words for timestep_0. Then the same word (“Les”in the figure above) is carried over as an input for the next timestep.

Already, we can try and figure out a few tweaks that could try and improve on this.

Why should we represent the entire input sentence as one fixed representation? Surely, if there’s a really long sentence, a fixed vector won’t do justice to the entire context. Here’s where RNNs with attention came in with the paper “Neural Machine Translation By Jointly Learning to Align & Translate”.

Encoder-Decoder RNNs with Attention:

Instead of using a fixed context representation, here we calculate a context vector at every timestep by searching through the input sentence. So effectively, given a hidden state of the decoder, the model decides which part of the input sequence is important for it to translate (or generate) the current word.

Encoder-decoder with attention

Let’s try and figure out what’s going on in the figure above. Our input sequence is given by words (X1, X2 …XT)where each word is fed in through a bi-directional RNN where h1,h2.. are the hidden states. Without attention, we were taking the vector h_T as our context vector. But in this case, we take a linear combination of all our hidden states via some weights alpha, and the previous state of the decoder S_t-1 to predict S_T and subsequently our translated word Y_t for the current time step.

Steps for attention model for a single prediction are as follows:

  1. Encoder & decoder: The encoder stays the same. Given input at time step t, the new hidden state is computed as a function of the input and the previous hidden state. The decoder used to have a fixed context vector c, but now at every timestep i of the decoder, we used a different vector.
Decoder
Encoder

2. Similarity Scores: How do we know how relevant the various hidden states are to predict the current word in the decoder? We compute an alignment score between all the hidden states and the previous decoder state.

Here e is a similarity score between each of the previous decoder hidden states and the encoder hidden states. F_att can be an MLP or even a dot product which would help in determining a score. The idea is that the higher the score the more that hidden state ( and hence the input ) would contribute to the current translated word.

3. Attention weights — Convert to Probabilities: We simply run all the similarity scores through a softmax to convert all of the scores into a probability distribution. These are our weights alpha as per the figure above.

4. Compute context vector: Once we obtain these probabilities associated with each hidden state of the encoder, we compute the context vector as follows

5. Predict the translated world using the context vector C_i, the previously predicted word Y_(i-1), and the previous decoder hidden state s_(i-1)

Example: English to FrenchSource Sentence: "I am eating pizza"Translated sentence: "je mange des pizzas"The intuition would be that while predicting the word "eating" the similarity score of "mange" would be higher.

The most astonishing part about this is, that this is not a supervised process. We are not explicitly telling the model to attend to a particular part in the sequence. Since the entire function was differentiable, during back-prop the model automatically converges to the best part of the sequence to attend to.

Disadvantages of using RNNs:

Somehow we want to get away from the sequential computations. With RNNs, each hidden state is dependant on the previous ones. With modern GPUs, we would like to process the entire input sequence in one go, with one giant matrix multiplication. Additionally, we must make sure that we can process sequences of longer length as well.

Generalization #1 — Single query Vector

  1. We have input vectors X (Shape: Nx, Dq) which can be thought of as the hidden states of the encoders in the previous example.
  2. We introduce the query vector q (Shape: Dq) which can be thought of as analogous to the previous decoder state that we were using to compute the similarity scores. Effectively, this can be seen as our query to search for the most similar input.
  3. As our similarity function, we use a scaled dot product where we multiply the query vector with the input scaled by the square of the dimension of the query vector. The score for the ith input in X is given as follows:
Side note: Why do we use the square root?A higher dot product would cause a higher softmax for that input, and a very low probability for all others. This causes very low gradients during back-propagation. Hence a scaling factor, the square root of the query vector dimension is used to prevent this.

Generalization #2 Multiple Query vectors & the attention Layer

  1. Instead of having a single query vector q, we can Nq of them, and compute the same similarity scores with matrix multiplication. So now we have Q (Shape: Nq, Dq)
  2. Input Vectors X (Shape: Nx, Dq)

Computations:

  1. Similarities: We compute the similarities with one matrix multiplication

Each element of this matrix is given by:

which is essentially the dot product between the ith query vector and the jth input

2. Attention Weights: Just as we were doing in the previous examples, we run our matrix E through a softmax to get a probability distribution over all of our inputs for each query vector.

A= Softmax(E, dim=1)

For 3 query vectors & 3 input vectors

Similarity Matrix

Post normalization each column of this matrix will sum to 1

3. Output Vectors: We define an output vector as the linear combination of the attention weights, analogous to how we were finding the context vector in the previous example.

Each element of the output is given by the linear combination of the columns of the attention matrix with the input.

Now we’re at a point where we almost have a self-sufficient layer with Inputs (X1, X2…) and outputs (Y1, Y2...).

Generalization #3 Key and Value Matrix

In the previous generalization, we were using our input vectors twice, once to calculate our similarities and then again to calculate our output.

To separate these two tasks, we create two learnable matrices:

Key Matrix & Value Matrix:

Key-matrix
Value matrix

The idea is to transform our inputs into two spaces, one when we would want to compute our similarities with our queries. And the other when we retrieve values that are related to our query.

All Computations involved:

Key vectors
Value Vectors

Notice how we’ve input twice to calculate both key & value vectors. Now we have to calculate our similarities with our keys and the output with our values.

Similarity Matrix between queries and keys
Attention weights
Outputs as a linear combination of values

This is it, that’s our “attention layer”, we can use this to compare our query with our inputs!

Generalization #4 Self-attention

We’re slowly getting into one of the building blocks of these transformers. In the previous generalization, we had separate query vectors. But what if you had to figure out in a sequence, what each word is referring to. Or in other words, given a word, we would like to figure out the attention weights throughout the rest of the sequence. For example, if you have a sentence like “I arrived at the bank after crossing the river”. Does the bank denote a riverbank or an actual bank? To represent “bank”, we need context from the entire sentence. The self-attention mechanism does exactly this, in parallel, unlike RNNs. This is the main workhorse of the transformer. Instead of recurrence, or convolutions, we use self-attention for input-output dependencies.

The transformer architecture uses self-attention in 3 distinct ways. Let’s look at the example that was given in the demonstration for the paper. The input sentence is “The animal didn’t cross the street because it was too tired”:

  1. Input-Input: All the words in the input attend to each other.

2. Input-Output: Like our RNN encoder-decoder architecture with attention, during translation, we want to know the word which words in the input sequence corresponds to the translation.

3) Output-Output: Instead of the connections, I wanted to show the weights in the corresponding diagram. Do you notice anything different?

In the translated output, it’s as if the network is not looking ahead in the translated sequence. This is logical since during translation you don't want your network to already attend over words that it hasn’t translated yet. In the example above, we see active weights words that came before “die”. This is a nifty trick used where you push the similarity matrix to a large negative number, so post softmax those probabilities go to 0. It’s referred to as masked self-attention.

Computations for self-attention:

  1. Inputs vector X ( Shape: Nx,Dx)
  2. Key Matrix: W_k ( Shape: Dx,Dq)
  3. Value Matrix: W_v ( Shape: Dx,Dv)
  4. Query Matrix: W_q ( Shape: Dx, Dq): The only new learnable matrix, which converts our inputs to queries
Query Vectors

That’s the only difference, we find the keys, values, similarities & output in a similar way to generalization #3.

Self-attention is permutation equivariant:

This means that any permutations in the input will result in the exact same output but permuted in a similar way. This means if you changed the order of the sentence, the model wouldn’t really know. This is why the authors embed an extra positional encoding with the input.

Generalization #5 Multi-Headed Self-attention

Multi-Headed self-attention as described in the paper

Instead of having one single self-attention block, we repeat this computation 8 times. Different sets of Key, Value & Query matrices are used to transform our input. We also lower the dimensions of each of these heads to keep it computationally similar to that of a single attention head. For instance, if we had used 512 as our model dimensions earlier, and if we had to split into 8 heads. We would use 64 as the dimension for each of our Key, Value & Query Matrix. Finally, when we concatenate all these 8 heads we would get an output dimensionality of (*Nx,64x8). But our input is that of (*Nx,512), which should be the same size as our output. Therefore would need another matrix W_o of shape (64x8, 512) to transform the output from our attention heads to (*Nx,512).

Example:We have a sentence which needs to be translated:"The boy plays on the field" We assume an embedding dimension of 512 for each word.Therefore our input X is of size (5,512)Now, let's assume 8 attention heads and an query dim of 64.Therefore each of our matrices (key,query&value) would be of size (512,64).Post applying the self-attention mechnaism each of these 8 heads would give us an output of (5,64).Now if we concatenate 8 of these matrices we would get one giant matrix of size (5,64*8) which would finally be transformed by the W_0 matrix to the same size as our input vector (5,512)

Other layers

Post this self-attention layer, there is layer normalization followed by a fully connected network. Additionally, there are residual connections between inputs and outputs between all of these layers. The same block of attention-normalization-feed-forward networks is repeated. They are used for both the encoder and the decoder. I intend to dig deeper into them in another article with a practical coded example.

--

--