12. Attention Mechanisms — Multi-Head attention

9 min readFeb 25, 2021

<Images, explanation and code taken from END course: https://theschoolof.ai/ and from https://github.com/bentrevett/pytorch-seq2seq>

When we are dealing with images, during the course of convolutions we split the image into multiple channels. Each channel then focuses on extracting certain parts of information from the images. As we go across the convolutional blocks, we try to extract edges and gradients, textures and patterns, parts of objects and then finally objects, and at each layer, different channels store different pieces of information.

While working with text, the contextual embeddings are stored as one single channel, and so when we add fully connected layers after this embedding layer, they work on the whole embedding layer data.

In order to extract more relevant information, we can split the (embedding layer and) fully connected layers into blocks of data. Say we have two fully connected layers, FC1 with 1024 inputs and FC2 with 512 inputs. We can now split both these layers into smaller blocks, say 4 blocks of sizes 256 and 128 inputs each. Now when we associate FC1 and FC2, we will be considering pairs of blocks between these two layers, instead of considering all inputs from FC1 being connected to all inputs from FC2. This allows each block to learn/focus on a different context. This can be considered analogous to how images are split into channels.

Each of these pairs of blocks now have their own fully connected parallel paths between FC1 and FC2. For simplicity, we will just refer to them as channels going forward

Query, Keys and Values:

We can understand these terms better if we look at how search engines work. When we search for a video on YouTube engine, the search words are our query. The search engine converts the entire query into a vector. It then searches within its titles to find a match. The titles are stored in a database somewhere, and have also been vectorized in the same manner as the search query. When we find a key vector which matches the query vector, we then get the video corresponding to that key from the database and return it to the user. This video is then the value.

Creating Contextualized vectors

Assume we have a embedding layer size of 768. Our current input is a sentence like ‘Walk to the bank’. Here, the word ‘bank’ can either mean the financial institution, or it can refer to the river bank.

For self attention, we take the embedding representation of these input words, so we get four vectors - one for each word. We create three copies of this set of vectors, which are known as key, query and value.

We now do an element-wise multiplication (scalar product) between the word embedding vectors from the query and key sets. This gives us correlations between the each set of words from the query and the key. It looks something like this:

While using scalar product, we do not have any control over the amplitude of the product values. If these values are too high, then it will affect back-propagation. So we divide our scalar product with the squareroot of the size of the vector. This is the value figured out by the authors of the paper to scale the amplitude of the products.

We then pass the scaled correlation matrix through a softmax layer. This amplifies the differences in the values, ensures the values are between 0 and 1, and all the values sum upto 1.

— -

Reason for the above operations:

“walk by river bank” — When this sentence is converted to embeddings, the word ‘bank’ will contain information both for place (ie river bank) and for money/financial institution. We need to find a way to remove all the monetary contextual information from the word embeddings for ‘bank’, and to increase the place-related context value.

To do this, we take our word embedding values (our copy of input, or query vectors), and multiply them with the softmax correlation matrix that we computed before. When we do this, we get a new set of contextualized word embeddings.

Input embeddings: 4 x 768, four words and with embedding layer size as 768. Correlation matrix is a multiplication of: 4 x 768 with 768 x 4.
We get 4 x 4 correlation matrix, which after scaling and softmax, is multiplied with the input embeddings to get the final contextualized word embeddings. In this new set of word embeddings, we get a more contextualized vector where we have also reduced the influence of some unrelated words (e.g. ‘by’ and ‘river’). Since we are not using RNNs here, and the whole sentence is sent as one input, this contextualized embedding helps tell us what all we should focus on

Reducing input size using FC layers

If our embedding layer has 768 inputs, we don’t need to work with the whole set of inputs to create the contextualized vectors. Instead, we can feed the 768 long vector to a neural network, and return a smaller-sized vector which we can then use for further operations.

<end12–17>

For the Query, Key and Value vectors, we will now have three separate fully connected layers. Each fully connected layer takes our 768 long input embedding and converts it into a smaller dimension. Since parameters have reduced, the next set of matrix multiplications also becomes faster. We are keeping three separate FC layers instead of using only one, since they gave better performance. Each of these three FCs will learn a slightly different representation of the same input embedding

Multi-head Attention Network:

In the above section, we had the following setup:

Input sentence was represented as a set of word embeddings of length 768 each. This input was duplicated as Query, Key and Value vectors. These Query, Key and Value inputs were passed through their own Fully connected layers to reduce the parameters. The reduced Query and Key inputs were used to create a correlation matrix, which was scaled down and passed through Softmax to get a weighted correlation matrix. The reduced Value vector was multiplied with this correlation matrix to get us our final Contextualized embedding vector, which can be fed to the decoder

<end12–18>

We can further enhance this contextualized vector representation. We can also split our single input vector into multiple smaller blocks. Say we split our 768 size embedding vectors into 4 blocks. Each of these four blocks will be a separate input block, and the operations described above (Query, Key, Value, FCs, Correlation Matrix) will be repeated for each input block. We now get ’n’ chunks of contextualized vectors, representing some concept relating to that word.

The embeddings we use in the embedding layer are learned embeddings. So it has meaningful information/concept related to each word. Different concepts are present in different positions of the embedding. After we split the input into blocks, each block may contain some concept. When we run the attention model on top of this, the final blocks of contextualized vector will have more fine grained control of the concept that they represent wrt the other words present in the input.

This is called Multihead Attention model. The input has been split into multiple heads, and we are running the attention model separately on each of these heads.

Code

The encoder-decoder network with multihead attention model is represented below:

Encoder is on the left, and decoder on the right. As explained in the previous section, the multihead attention network takes three inputs — Query, Key and Value. For the encoder, all three are same as the input embedding with additional positional embedding information

The decoder has two MultiHead attention networks.

Encoder

The complete encoder model is below

The inputs are passed to the embedding layer to get the tokenized embedding vectors. This is then concatenated with the positional embedding vectors to give the final input vectors to the Multi-Head attention model.

The Multi-head attention model is added with a residual connection, and then we normalize the final values. This is then sent to a fully connected layer.

The code is split into:

Encoder class, which readies the input and positional vectors,
MultiHeadAttentionLayer class, which contains the attention layers
EncoderLayer class which encapsulates the multihead layers and returns the final encoder output

Encoder Class

Encoder Code Explanation:

Line #15 and #16: we create the input and positional embedding vectors, which are added in the forward function
#18: n layers of Multi-head attention model are created here
#27 scale factor is the sqrt of the embedding vector size, this is required to scale down the correlation matrix values so that back-propagation does not have to work with very large values
Line #34: Batch size is sent first, this is from the dataset created
Line #37: Create the positional vector which will be passed to positional embedding layer
Line #41: We scale the token embedding vector with the scaling factor, and then add it with the positional embedding vector
Line #45: Loop over all the multi-head layers. Output of previous layer becomes input to the next layer
Line #46: Masking inputs. This removes the padding tokens from being sent to the multi-head attention network since it holds no value. Source mask is the same shape as the source and is 1 where pad is not there, and 0 when the pad token is present

Multi-Head Attention Layer Code

Each multi-head attention layer contains the following blocks:

MultiHeadAttentionLayer Code Explanation:

Line #9: heads_dim is the dimension when we split the input into n heads
Line #11, 12, 13: Define the FC layers for Query, Key and Value
Line #29, 30, 31: We pass the q, k and v inputs to the fully connected layers to get a sorted/reduced dimension output. This is still of size [batch_len, src_len, hid_dim]
Line #37, 38, 39: We split the Q, K and V inputs into n heads

Q.view(batch_size, -1, self.n_heads, self.heads_dim), takes the batch of Q inputs, takes the last (-1) dimention, and repeats it n_heads times. heads_dim is nothing but the final dimension after splitting hid_dim/n_heads

permute(0, 2, 1, 3) — 0 is the batch dimension. We permute the 2nd and 3rd dimensions for computing

Line #45: Multiply Query and Key (needs permutation), and then scale the product
Line #49: If padding token is present, mask is sent and is 0. Wherever mask is there, we set the input to some small number close to 0
Line #52: Pass the correlation matrix to Softmax layer
Line #56 : multiply this attention vector to the Value vector to get the contextualized vector
Line #60: The dimension of x contains n_heads, but the final output needs to be concatenated across all heads. We do this permutation so that we can concatenate correctly (contiguous ensures the data is stored in contiguous or close-by memory locations to make the computation fast)
Line #64: Concatenate back to the hid_dim vector, so that the contextualized vector is same as the size of the input vector
Line #68: Send this input to another fully connected layer, and return the attention matrix as well for visualization

Encoder Layer Class

This contains the attention layer block, the residual connections, the FC layer and the normalization layers.

EncoderLayer code explanation

Line #24: The inputs to the multi-head attention layer are source, source, and source. We also pass the source mask so that we can ignore pad tokens
Line #27: Residual connection, adds original source to the output of the attention model. This is then passed to a normalization layer (self_attn_layer_norm)
Line #32: Pass the output through a pointwise feed forward network
Line #35: Residual connection, adds attention output and pointwise feed-forward network output, and passes it through a normalization layer

Pointwise FeedForward Layer

Input is transformed from hidden dimension to some large dimension

Decoder

The Decoder is similar to the Encoder, except that it has two multi-head attention layers

The masked multi-head is slightly different from the encoder’s masked multihead.

There are no RRNs in this model, so there is no recurrence. Therefore instead of sending one word at a time, we send through all the words of the ground-truth (ie target value). But when we are making predictions for the next word, we do not want to look at the ground-truth value for making the next prediction, we only want to look at the words we have seen till now. Therefore the next words need to be masked

Decoder Class

This is very similar to the Encoder class

DecoderLayer class

Line #28: The inputs to the first attention layer are target, target, target and target mask (Q, K, V)
Line #36: The inputs to the second attention layer are the decoder target as Query, and then the encoder output and encoder output as K and V
i.e the Query is now from the target
<explain further>

Sequence 2 Sequence Model