Transformer: The Self-Attention Mechanism

Published in

Machine Intelligence and Deep Learning

11 min readMay 2, 2022

This post gives a brief overview of the popular self-attention mechanism introduced in ‘Attention is All You Need’ paper, which is a de-facto of Machine Learning tasks nowadays.

A video presentation is available at this link.
The implementation can be found at this link.
Blog by- Zubaidah Al-Mashhadani, Sudipto Baul

Intuition of the paper:

The transformers propose a simple network architecture that is based on the attention mechanism. With the use of parallelization, the translation tasks are superior in quality while consuming significantly less time than the sequential models such as recurrent neural networks. Moreover, the paper shows that the transformer generalizes well to other tasks with large or sparse data.

So, what is the transformer?

If we think of transformer in the simplest way, it is basically a Blackbox where it has input that we want to translate and output of the translation. Now, by examining the Blackbox, we see that it consists of an encoder and a decoder component.

The encoding component it is a stack of six encoders, similarly the decoding component. The encoders are designed such that they have an identical structure, every encoder has two sublayers including the Feed Forward Neural Network, and the Self- Attention layer. Hence, the input first go through the self-attention then the feed-forward layer. Likewise, the decoder also has the same layers in addition to a third layer that takes place between the previously introduced layers, this layer is called the Encoder — Decoder Attention. It helps the decoder to focus on relevant parts of the input sentence.

Let’s take a step back to see how exactly the model deals with an input sentence to output a desired translation. In natural Language Processing (NLP), generally words can be seen as distinct inputs that have a relation with one another in the sense of meanings. Thus, we transform each word into a vector as words cannot be naively passed to a model or neural network without being encoded into a numerical form, hence comes the use of the embedding algorithms. There are multiple approaches to generate words embeddings that can be mainly split into two categories: the probabilistic approaches and count-based approach. The choice of the word embedding is the most significant step in the preprocessing when performing an NLP task. Some of the widely used embedding algorithms are BERT, word2vec, GloVe … etc.

The embedding only happens in the bottom most encoder. Where all encoders receive a list of vectors in size of 512. After embedding the words of the input sentence, each vector flows through two layers of encoder. We can now observe a very important property of the transformers that is it works in parallel. In other words, the word at each position flow through its own path in the encoder. Meanwhile in RNN such as LSTM the flow of data was sequential not parallel and that consumes more time, hence, transformers are much faster using parallelization property. Figure 1 shows the architecture of the transformer.

The Encoder:

The encoder maps an input sequence of symbol representation (x1, x2, …, xn) to a sequence of continuous representations Z = (z1, …, z1), given Z the decoder generates an output sequence (y1, …, ym) of symbols one element at a time. The transformers apply this process using stacked self-attention and pointwise, fully connected layers for the encoder and the decoder. As explained earlier, the encoder receives a list of vectors as the input. It processes this list by passing it to self-attention layer, then to a feed-forward neural network, and sends out the output to the following encoder.

We mentioned self-attention multiple times so far, but what do we mean by self-attention? To understand this let’s explore the idea of attention first. As humans, we use our visual attention to focus on important features when looking at a picture for example, refer to the cat image shown below, we would notice the features that are detected by the red squares can make a refer about the image while not paying attention to the background for instance in the purple rectangles, moreover, if we cover those important features then we probably will not be able to make a correct inference.

In similar way, we can explain the relationship between words in a given sentence. As illustrated below, where we can see “watching” we expect an encounter as a movie, show, or a play very soon. Meanwhile, the word French describe is related to the show, however it is not directly related to “watching”. Hence, self-attention basically allows the model to look at the other positions in the input sequence while processing each word, which will lead to a better encoding.

Self-Attention:

Now that we discussed the importance and role of self-attention in transformers, it is time to explore the calculations, and the implementation of it as follows:

1- Create three vectors for each word by multiplying each token (1x512) by three matrices (Query, Key, Value) each (512x64) that were trained so we will have three vectors for each word such that each vector of size (1x64).

2- Calculate the score (weight) by using dot product of the query vector with the key vector of the words (tokens). The score will determine how much focus to place on other parts of the inputs as we encode a word at a specific position.

3- Now divide the score by the square root of the dimension of the key vector which equals 8 in this case.

4- Apply SoftMax operation to normalize the scores from 0 to 1 such that all weights will add up to 1.

5- Next, multiply each vector value with the score.

6- Finally sum up the weighted value vectors to represent the encoding of the specified token.

7- Repeat for all words, we end up having an attention map that fully encode the words using attention.

Matrix calculation of Self-Attention:

We start by calculating the Query, Key, and Value matrices. This is obtained by multiplying the matrix of the packed embeddings, by the weight matrices (WQ, Wk, WV) as shown below:

Figure 2. Matrix Calculations of Self-Attention

Where, each row in the matrix (X) corresponds to a word in the input sentence.

Next, obtaining the score, by multiplying the Query by the Key matrix, dividing by the square root of the dimension of the key, and also applying the SoftMax operation, and multiply the scores by the values, are all done in one step as follows:

Multi-Headed Attention:

The self-attention layer is refined further by the addition of “multi-headed” attention. This does improve the performance of the attention layer by expanding the model’s ability to focus on different positions while encoding for better predictions. Moreover, it gives the attention layer multiple “representation subspaces” as in the multi-headed attention we have multiple Query, Key, Value weights matrices instead of one. The transformer uses eight attention heads, which leads to having eight sets of Q, K, V matrices and eventually, end up having eight Z-matrices. Where, the attention is calculated separately in eight different attention heads.

This arises a challenge! The feed-forward layer is not expecting eight matrices, in fact it is expecting a single matrix. To overcome this, first concatenate the Z matrices from all the attention heads. Then, multiply by a weight matrix Wo that was trained jointly with the model. Finally, this would result in Z matrix, which will capture the information from all the attention heads and this matrix is passed to the feed-forward layer.

Positional Encoding:

As we discussed earlier, each word is represented by a vector using embedding algorithms which yields a token. Now, since the transformers works in parallel, it is important to keep track of the words position within a sentence. The transformers overcome this issue by adding a vector to each input embedding. These vectors follow a pattern that the model learns to help determine the position of each word, or the distance between different words in the sentence. The idea behind it, is that it provides a meaningful distance between the embedding vectors once they are projected into the Query, Key, and Value vectors and during the dot-product attention. This is called positional encoding. For example, if the embedding has a dimensionality of 4, the positional encoding is as follows:

It’s important to note that the positional encodings have the same dimensions as the embeddings so that both can be summed as shown in the previous example.

Sine and Cosine functions of different frequencies are used:

PE(pos,2i) = sin(pos / 100002i / d_model)

PE(pos, 2i+1) = cos(pos / 100002i / d_model)

Where pos is the position and i is the dimensions. Such that, each dimension of the positional encoding corresponds to a sinusoid.

In each sublayer, such as self-attention and FFNN, in each encoder has a residual connection around it followed by a layer-normalization step.

To put everything, we learnt about the encoder in one big picture we can see it as follows:

The Decoder:

We saw that the encoder takes the input sequence and process it. The output of the top encoder is transformed into a set of attention vectors K and V. These are used by each encoder in its “encoder-decoder attention” layer that helps the decoder focus on suitable positions in the input sentence. The self-attention layers in the decoder are slightly different than the ones in the encoder, as in the decoder the self-attention layer is only allowed to attend to earlier positions in the output sequence. This is achieved by masking the future positions before the SoftMax step in the self-attention calculation. Moreover, the “Encoder — Decoder Attention” layer works similar to the multiheaded self-attention except that it creates Queries matrix from the layer below it and takes the keys and values matrix from the output of the encoder.

Final Linear and SoftMax Layer:

The output of the decoder is a vector of stacked outputs that we need to turn into a word. Here comes the role of the final layer that is followed by the SoftMax layer. The final layer is a fully connected neural network that projects the vector produced by the stack of decoders into a larger vector that is called a logits vector, each cell of the logits vector corresponds to the score of a unique word. Then, the SoftMax layer turn those scores into probabilities that all would add up to 1.0. Finally, the cell with the highest probability is then chosen and the output is the word that is associated with it for this time step.

Experimentation:

The original model in the paper [1] was trained on the standard WMT 2014 English-German dataset consisting of 4.5 million sentence pairs and English-French dataset consisting of 36M sentences. Sentence pairs batched together by approximate sequence length. Adam optimizer used with variable learning rate. Dropout was applied to the output of each sub-layer before being added to the input of the next sub-layer and normalized. Also, dropout was applied to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks. Label smoothing was also applied.

Modification for the project:

For this course project, we applied the model on standard WMT 2014 English-German dataset only following the code of [2]. Also, parts of the training process were modified. Positional encoding was learned instead of static. Moreover, static learning rate was used and no label smoothing was employed.

Results:

The table 1 below represents the different variations of hyperparameters of the transformer model. Also, it presents the perplexity score (PPL), BLEU score and number of trainable parameters used for each version on the EN-DE dataset. At the top, with the lowest values of the hyperparameters and hence lowest number of parameters, is the base model. At the bottom, there is the big version of the transformer with highest number of parameters as a consequence of higher value of the hyperparameters although it achieves the best BLEU and PPL scores.

Table 1. Variation of the transformer model [1]

Comparison of transformer’s results with the other models introduced in previous works are given in the table 2 below. The BLEU score is given for English-German (EN-DE) and English-French (EN-FR) tasks. The training cost is also provided for the stated tasks in terms of Floating Point Operations (FLOPs).

Table 2. Comparison of transformer with other models [1]

Modified model’s results:

We achieved BLEU score of 35.38 and perplexity (PPL) score of 5.238 for the modified version of the model on the EN-DE translation task which is comparable to 26.4 BLEU score and 4.33 PPL score for the original model. Probably, the better BLEU score for the modified version was because of the use of learned positional embedding instead of a static one.

Working of the attention mechanism:

Translation of a sentence from German (src) to English (predicted trg) using the modified model can be found in Figure 5 below. The true translation of the sentence (trg) is also given for comparison. The model could translate the sentence almost perfectly.

Figure 5. Translation of a sentence using transformer

To understand the attention mechanism in greater depth, the attention weights of each head was extracted from the model for the above translation. The attention matrices formed by the attention weights over the translation of each word (EN-DE) for the eight heads used in the model, is given in Figure 6 (lighter color means higher value). It can be observed that mostly the attention value is higher along the diagonal showing that the model gives attention to the specific word most of the time. But there are some lightly shaded areas in some cases representing the attention given to the neighbors for translation of a particular word. Also, the attention given to the words differs over the heads showing that each head picks up the importance from different perspectives during translation. For example, true translation of German word ‘ein’ is ‘a’. From the figure, for some heads, attention is given to ‘a’ and not for others while translating ‘ein’.

Figure 6. Attention matrices formed by the heads for a translation

Conclusion:

The Transformer model is the first sequence transduction model based entirely on attention. It replaces the recurrent layers with multi-headed self-attention. For translation tasks, it can be trained significantly faster than recurrent or convolutional architectures. It outperforms all previously reported ensembles achieving a new state of the art result on WMT 2014 English-German translation task. The model can be extended to problems involving other input/output modalities- image, audio, video. Also local, restricted attention mechanisms can be investigated.

References:

[1] Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems 30 (2017).

[2] https://github.com/bentrevett/pytorch-seq2seq/attention

Transformer: The Self-Attention Mechanism

This post gives a brief overview of the popular self-attention mechanism introduced in ‘Attention is All You Need’ paper, which is a de-facto of Machine Learning tasks nowadays.

Written by Sudipto Baul