Evolving Self-Attention: Positional Encoding, Multi-Head, and Masked Attention — Transformers, Part 3

11 min readMay 13, 2023

In Part 2 of this series, we took a deep dive into naive-self attention, where we look into naive self-attention through equations and visualizations and realized issues with them. Our goal was to explore how we could shift from the traditional Recurrent Neural Network (RNN) based Seq2Seq models to a new paradigm that leverages attention layers for generation tasks — tasks typically handled by RNNs.

(link to part 1)

(link to part 2)

Continuing our journey in Part 3, we will address the issues of naive self-attentions using concepts such as Positional Encodings, Multi-Head Attention, the addition of non-linearities, and Masked Multi-Head Attention. This will equip us with most of the fundamental building blocks needed to understand the comprehensive transformer architecture.

Let’s Start!!

Figure 1: Naive-self attention layer (from part 2 of the series)

Looking at the naive self-attention issues one by one

Problem 1: Solved by Positional Encoding

Self-attention does not have any notion of proximity in time ( for example, x1,x2, and x3 are processed entirely in parallel, without any regard for their order). If we switch their order, the self-attention layer would not care.
For example, let us say there is a sentence: I love dancing with you. Naive self-attention will see the sentence as a bag of words. ‘I love dancing with you’ is the same as ‘dancing with love I you’. However, the second sentence doesn’t make any sense.
Most of the permutations of words will be non-sense, some will change the meaning, and in natural language processing, the order of the word matters. Therefore, the words cannot just be supplied as a random input, we need to keep track of the order of words.

How do we preserve the ordering of the words/tokens?

In transformers, the notion of position is added while giving input (x_t) such that all the x’s in the self-attention know their relative positions. In short, it will know which word is where. Looking at Figure 1, we can simply add the notion of index/time. It is represented as t in equation 1.

This ‘t’ will give information about the order of the tokens in the sequence and allow self-attention to make use of that. So, in terms of maths, if x_t was the earlier input, now we are adding ‘t’. Say it is a new input to the system represented by x_t hat (equation 2). This x_t hat is called a positional encoding (naive version). Just appended ‘t’ to the previous input

But, this absolute positioning (t) is not so relevant (bad idea)!!

For example:

sentence 1: ‘I love dancing with you
sentence 2: ‘With you I love dancing’

In both sentences, the meaning is the same, but the word order (and thus, the absolute positions of the words) has changed. If we encode only the absolute position ‘t’ as a positional encoding, the model would struggle to understand that the sentences have the same meaning, as it would not be able to capture the relationships between words in different parts of the sequence.

Thus, let’s use better positional encoding, which pays more attention to the relative positions in comparison to absolute positions.

In short, we want to represent the position in a way that tokens/words with similar relative positions have similar positional encodings.

For example:

sentence 1: I love dancing with you
sentence 2: ‘With you, I love dancing

In both sentences, the meaning is the same, and the relative positions of the phrases “I love” and “dancing with you” are preserved, although the word order (and thus, the absolute positions of the words) has changed.

In transformers, this is achieved by appending frequencies of the time step instead of actual time steps. In paper the way positional encoding is used is as follows:

The positional encodings (represented by vector p_t, equation 3) have the same length as that of the input x_t
Every entry in positional encoding vector p_t is just alternative sine and cosine functions (equation 3)

The frequency term (equation 4) has ‘t’ which is the absolute position of the token in the sequence. It has ‘d’ which is the dimension of the positional encoding vector. It has ‘D’ which is the total dimensions of the positional encoding vector. D is kept as 512 in the original paper.

The term (10000^(2d/2D)) acts as a frequency term because it scales the input position ‘t’ differently for each dimension ‘d’. This results in a unique sinusoidal encoding for each position ‘t’ in the sequence. As ‘d’ increases from 0 to D-1, the scaling factor (10000^(2d/2D)) ranges from 1 to a very small value, yielding a set of sinusoidal functions with different frequencies.

This positional encoding works pretty well as shown in the original transformer paper and captures relative information which absolute encoding (‘t’) was not able to do.

Incorporating positional encoding in self-attention models

The simple choice is to concatenate them (equation 5), just like we did in equation 2.

However, this is not what is done in the transformer model. They create an embedding of x-t input and then add the positional encoding to it (equation 6):

Equation 6 is just one of the ways to combine x_t and p_t which works pretty well according to the transformers paper. Embedding over x_t could be some learned function, for ex. some fully connected layers with linear layers and non-linearities.

Thus, using the above concept of sinusoidal positional encoding, we can address the problems of self-attention about remembering the relative and absolute position of the words/tokens.

Problem 2: Solved by Multi-Head Attention:

Each self-attention layer had 1 key, value, and query for each time step (Figure 1). But why just restrict to only 1 key, query, or value pair? This key, value, and query can be thought of as filters in the convolutional networks, and as we use a lot of filters per layer, we can similarly use a lot of key, query, and value pairs.

For example, consider the sentence: “The cat chased the mouse and it ran away.”

In this sentence, there are multiple pieces of information to process, such as the subject (“the cat”), the verb (“chased”), the object (“the mouse”), and the pronoun (“it”). If we were to use just one key, query, and value pair, the self-attention mechanism might struggle to capture all these different relationships simultaneously.

By using multiple key, query, and value pairs, we can have a more expressive representation for each word. Each set of key, query, and value pairs can be thought of as “filters” that capture different aspects of the sentence. For example:

One key, query, and value pair might be specialized in capturing subject-related information, allowing the model to learn the relationship between “the cat” and “it” in the sentence.
Another key, query, and value pair might be specialized in capturing verb-related information, focusing on the relationship between “chased” and “ran away.”
Yet another key, query, and value pair might be specialized in capturing object-related information, connecting “the mouse” with “it.”

By having multiple key, query, and value pairs, the self-attention mechanism becomes more powerful and flexible, allowing the model to attend to various types of information and relationships within the input sequence. This is similar to using multiple filters in convolutional networks to capture different features from input images.

Going in mathematical detail (set of equations, mentioned as equation 7, explained in blog 2 in detail):

Intuitively, from equation 7, we can say that if only one key and query are associated with 1 word, the information learned between the query and key would be of one type (either say subject or a verb relationship), since individual matrices Wq and Wk may not encode information about many relationships.c

Thus, the idea is to use Multi-head attention

Instead of outputting 1 key, query, or value, we can output multiple (Figure 2).

Figure 2: Going from single-head self-attention to multi-head self-attention.

Expanding on Figure 2, further, we will have an attention score for each self-attention head (Figure 3)

Figure 3: Visualization of attention score for the multi-heads for query at time-step ‘l’ = 2 (time-steps for keys are denoted by ‘t’ and for queries are denoted by ‘l’)

In Figure 3, we have three heads, and attention are shown for query 2.

Modifying 1st in equation 7, to incorporate the multi-head, we will get equation 8:

to revise, ‘l’ is the position at which we are computing the attention (which is 2nd in the above figure), ‘t’ is the position from which we are trying to get the answer to the queries/or with which queries are interacting (t = 1,2 and 3 in above figure) and i is the head (we have now 3 heads above).

Similarly, we can do softmax (equation 9)

Similarly, we can calculate attention for each head (equation 10):

To get to the full-attention vector for query 2, we can just concatenate attentions over all heads (equation 11) for a query ‘l’. ‘l’ is 2 as an example in Figure 3.

Now, one of the heads is responsible for verbs, another for adjectives, another for subjects and so on which is a lot more powerful, than just using 1 head. Big models use around 8 heads.

Problem 3: Solved by adding non-linearities

Till now, attention is obtained as a linear combination of values (Equation 8, in the previous blog. Re-written as Equation 12, just for reference)

So, from equation 12, attention, a(l), is linear in the values v(t) and values v(t) are linear in the h, h(t) (given by set of equation 13):

The term alpha(l,t) is non-linear and the weight matrix of values (Wv) is a linear transformation.

So every self-attention layer (given by a_l, where a_l is just for 1 layer, and so layer index is not available) is a linear transformation of the previous layer (with non-linear weights, alpha (l,t)). Intuitively, even though self-attention is very good at fetching information from other points in time, the problem is linear operations are not very powerful in terms of performing complex operations. This is because linear operations can only capture linear relationships between variables, which can be limiting when dealing with real-world problems that often involve non-linear relationships.

So, let’s add some non-linearity into the process!!

Very simple way: Alternate the self-attention layer, with some kind of a learned non-linear function. The nonlinearity is shown in Figure 4 and non-linearity is applied independently over every time step (t) (equation 14). This type of neural -network applied at every position after every self-attention layer is called as position-wise non-linear network/or a feed-forward network:

Figure 4: Visualizing non-linearity over the output of a single head-self-attention (multi-heads not used for simplicity)

Problem 4: Solved by Masked Attention

Self-attention does not distinguish, between the past and the future, which information can be looked at from the future and the past (remember queries interacting with all the keys ?).
If self-attention is used for the generation/decoding process then it’s a problem. Because, for the generation process, we would like that the time-step ‘t’ should be able to look at the information only from time steps up to ‘t-1’ and use all that information to generate information at the time ‘t+1’.
However, according to Figure 5, self-attention has a circular dependency. Inputs to steps 2 and 3 will be based on the output of 1, however, the output of 1 can see the inputs in layer 2 and layer 3 already. However, we do not have those, so we cannot use it for generation/decoding.

Figure 5: self-attention at step 1 can look at the value at steps 2 and 3, which is based on the inputs at steps 2 and 3 (leading our way to masked attention)

Solving the problem using Masked Attention!!

Simple fix: Allow, self-attention to look into the past, and block self-attention in looking into the future somehow. So delete the connections in the future.
Mathematically (equation 15), set attention scores to negative infinity if query time is less than the key time step.

Sending this to the softmax (equation 9) will lead to the value of zero for queries less than the keys time step.

Now, the self-attention model can be used to decode/generate future sequences!!

In practice/codes, equation 16 is followed, to avoid dealing with infinities:

Thus, in this blog, we did the following 4 things:

Used positional encodings (on the inputs) to make the model aware of the relative positions of tokens
Use multi-head attention
Alternate self-attention “layers” with nonlinear position-wise feedforward networks
Use masked attention if you want to use the model for future generation tasks.

That’s all for part 3!!

After discussing basic building blocks, in the fourth part of the transformer series, the complete transformer architecture is discussed