What is Relative Positional Encoding

How does it work and differ from absolute positional encoding?

11 min readJan 7, 2024

Relative positional encoding is another positional encoding method used in NLP to give positional information about the input sequence of words that are given to the transformer.

We know that a Transformer model without positional encoding cannot differentiate between the same word in position 1 or position 2.

There are three parts to this article:

What is the difference between Absolute Positional Encoding and Relative Positional Encoding
What makes Relative Positional Encoding “relative”
Implementation in Pytorch

(Part 1) What is the difference between Absolute Positional Encoding and Relative Positional Encoding

Differences in the information it represents

The difference between absolute positional encoding and relative positional encoding is that relative positional encoding considers the information between pairwise positions into account, while absolute positional encoding does not.
For example, in absolute positional encoding, we create an embedding for each position without caring about other positions. if we have positions 1, 2, and 3, we can randomly initialize the embeddings each position to have a hidden dimension of size 3; position 1 = [0.1,0.2,0.3], position 2 = [0.5,0.1,0.3], position 3 = [0.4,0.2,0.1]. Because we are randomly initializing the positional encodings, we do not care about how one position relates to another position
In relative positional encoding, we do not only randomly initialize the embeddings for each position. Instead, we create a pair-wise embedding matrix of size (T, 2 * T — 1), where the row-index represents the word we are interested in, and the col-index represents the positional distance from the word (indicated by row-index) to all the other words (indicated by col-index) that can be before and after it.
For example, if we have a sentence [‘this’, ’is’, ‘awesome’], then the relative positional encoding matrix[row-index= 0,:] will tell us, what is the positional distance from the word at position 1(‘this’) is to all the possible words that can come before and after it, i.e.:
[2 words before ‘this’,
1 word before ‘this’,
0 word before ‘this’,
1 word after ‘this’,
2 word after ‘this’]
From the above, you can see why the shape of the relative positional encoding matrix is of shape (T, 2 * T — 1). This is what they meant by ‘pairwise’
You can read this excellent post for a more detailed analysis of how it works

Difference in how you implement it

As opposed to adding the positional information element-wise directly to the token embeddings, relative positional information is added to keys and values on the fly during attention calculation.
It turns out that we can just add relative positional information on the fly to the attention matrix as introduced in Huang et al., also known as the music transformer paper. (This will be what I will use for the implementation below)

https://jaketae.github.io/study/relative-positional-encoding/
In Huang et al., also known as the music transformer paper, the authors pointed out that calculating relative positional encodings as introduced in Shaw et al. requires O(L2**2 D) memory due to the introduction of an additional relative positional encoding matrix.

If you have a keen eye, you will be asking “Isn’t the pairwise distance position of shape (T, 2*T-1)? How do we add them to the attention matrix which is of shape (B, T, T)?”. I will show this in the “Implementation in Pytorch” section below.
I think you can already guess that the implementation of the absolute positional encoding is very trivial, but the implementation of the relative positional encoding is still trivial, but not as trivial.

(Part 2) What makes Relative Positional Encoding “relative”

To iterate, the difference between absolute positional encoding and relative positional encoding is that relative positional encoding takes the pair-wise distance between each position and other positions
Relative Positional Encoding is ‘relative’ because now the positional encoding takes into account the relative distance from the position of a word to all the positions of other words.

Example:

Below is the pair-wise distance matrix provided by the author that depicts the of word_i to other words from word_i-4 to word_i+4

Using the image above as a reference:

Let us see what happens when we fix the position of a word A, and then change the position of a word B.
Suppose you have word A at position 1 and word B at position 3. In that case, the positional encoding is based on the difference in the positions of word A and B, because word B is 2 positions in front of word A, the positional encoding between word A and word B will be at index 6!
Suppose you have word A at position 1 and word B at position 5. In that case, the positional encoding is based on the difference in the positions of word A and B, because word B is 4 positions in front of word A, the positional encoding between word A and word B will be at index 8!
Notice how your positional embedding is now based on the relative distance of word B to word A!
This is different from absolute positional encoding that we see, whereby we do not care how far apart word B is to word A; (In absolute positional encoding, we just add the embedding that represents the position)

(Part 3) Implementation in Pytorch

I got this implementation from the following excellent blog post

We know that we need some way to create the relative positional encoding matrix that we see:

2. We know that for this to happen, the relative positional encoding matrix must be of shape (T, 2 * T -1) where T is the time step.

3. We also know what we want for the relative positional encoding for each word.
For example for the sentence [‘’this’, ‘is’, ‘very’, ‘awesome’, ‘too’]:

→ The relative position encoding for the word ‘this’ should be:
[index_4, index_5, index_6, index_7, index_8].

→ The relative positional encoding for the word ‘is’ should be:
[index_3, index_4, index_5, index_6, index_7]

→ The relative positional encoding for the word ‘very’ should be:
[index_2, index_3, index_4, index_5, index_6]

→ The relative positional encoding for the word ‘awesome’ should be:
[index_1, index_2, index_3, index_4, index_5]

→ The relative positional encoding for the word ‘too’ should be:
[index_0, index_1, index_2, index_3, index_4]

(If you cannot understand this, please refer to the blog post in the image above)

Now the question is:

How do we get the relative positional encoding matrix?
How do we get the relative positional encoding for each word that we want?

(1) How do we get the relative positional encoding matrix?

We randomly initialize a positional embedding tensor of shape (2 * T -1 , D), where D is the embedding dimension that is the same as our Query vector shape (B, T, D).
We matrix multiply the Query vector of shape (B, T, D) with the transpose of the positional embedding tensor of shape (D, 2*T-1), resulting in the positional encoding matrix of shape (B, T, 2*T-1), which is what we wanted!

batch_size = 2
time_steps = 5
embedding_dim = 11
positional_embedding = torch.nn.Parameter(torch.randn(2 * time_steps - 1, embedding_dim))
print(positional_embedding.shape) # torch.Size([9, 11])
assert positional_embedding.shape == (2 * time_steps - 1, embedding_dim)

# X can be the Query or Key vectors of shape (B, T, D)
X = torch.randn(batch_size, time_steps, embedding_dim)
relative_positional_encoding = torch.matmul(X, positional_embedding.transpose(0,1))
print(relative_positional_encoding.shape) # # torch.Size([2, 5, 9])
assert relative_positional_encoding.shape == (batch_size, time_steps, 2* time_steps -1)

# 5 words
relative_positional_encoding[0] # (check single example)
# tensor([[ 0.5887, -4.4982, -2.5649,  4.4121,  2.2272, -0.7798,  3.9913, -4.8487,
#           1.6224],
#         [ 2.1521,  1.9125, -5.2178,  2.5203, -4.3512,  1.1188, -3.2300, -0.0258,
#           3.9102],
#         [-0.8744,  2.0198, -0.4761, -0.5517,  2.2594,  2.7325,  1.1605, -2.2300,
#          -3.9419],
#         [ 4.2827, -1.2526, -1.9999, -0.7072,  0.1548,  0.0283, -4.7288, -3.0938,
#          -3.3099],
#         [ 0.8030, -3.0945, -1.6351, -2.0284,  9.4586,  8.5905,  0.2924,  6.9596,
#           5.3160]], grad_fn=<SelectBackward0>)

3. Notice that here, we are not explicitly calculating the relative distance of a word to other words. We are leaving it to the neural network to learn it from the weights of the positional embedding: (positional_embedding = torch.nn.Parameter(torch.randn(2 * time_steps — 1, embedding_dim)))

This is quite intuitive to me:
→ Since we are doing the matrix multiplication of shape (B, T, D) with shape (D, 2T*-1), this means that for each time step, we are calculating the dot product to all other possible 2 * T -1 positions.
→ We already know that dot product as a measure of similarity, this means that this matrix multiplication is calculating the distance from one position to another position based on similarity!
→ Therefore, instead of explicitly calculating the relative distance of a word to other words, we are giving the neural network the flexibility to decide what relative distances work best for our dataset!

(2) How do we get the relative positional encoding for each word that we want?

This involves some tensor manipulation magic!

This is the relative positional encoding matrix that we got from the step (1) above:

Relative positional encoding matrix from step (1)

Let’s review what we want:

3. We also know what we want for the relative positional encoding.
For example for the sentence [‘’this’, ‘is’, ‘very’, ‘awesome’, ‘too’]:
→ The relative position encoding for the word ‘this’ should be:
[index_4, index_5, index_6, index_7, index_8].
→ The relative positional encoding for the word ‘is’ should be:
[index_3, index_4, index_5, index_6, index_7]
→ The relative positional encoding for the word ‘very’ should be:
[index_2, index_3, index_4, index_5, index_6]
→ The relative positional encoding for the word ‘awesome’ should be:
[index_1, index_2, index_3, index_4, index_5]
→ The relative positional encoding for the word ‘too’ should be:
[index_0, index_1, index_2, index_3, index_4]

Therefore, these are the position encodings that we want for each word:

relative position encoding according to relative distance of one word to other words

Notice that this is just a sliding window with size 5, where the window for the first word begins from index -5 to index -1, the second word begins from index -6 to index -2, and so on.

You can definitely do this iteratively by the starting index, but to do this efficiently as opposed to iterating across starting indices, there are 5 steps:

Pad a column with zeros on the right-hand side of the original relative positional encoding of shape (B, T, 2T-1) to become (B, T, 2*T).
The reason why we add padding is to create a step-wise pattern which you will see later on.

Padding on the right-hand side of the relative positional encoding matrix. (starting index for the last dimension is 5 to nicely show that a column of zero is added)

2. Flatten the padded relative positional encoding from (B, T, 2*T) to (B, T * 2T). We flatten this because we are about to add additional padding such that we can reshape the matrix to shape (B, T+1, 2 *T -1)

Flattening of the relative positional encoding to add additional padding for reshaping later on

3. Add additional padding such that we can reshape the matrix to shape (B, T+1, 2 *T -1)

Adding additional time_steps -1 padding so that we can reshape the matrix from shape (B, 2 * T**2) to shape (B, T+1, 2 *T -1)

4. Reshape the matrix from (B, 2 * T**2 + T-1) to (B, T+1, 2*T-1).
(Note that we will not be able to reshape this matrix if we do not do the steps 1 to 3!). The pink rectangles are the step-wise pattern created by the paddings.

Reshaped the matrix from (B, 2 * T**2 + T-1) to (B, T+1, 2*T-1).

5. Then we extract the upper right-hand rectangle of shape (T, T)

Extracting the desired relative positional encodings desired for each word

We compare this to what we initially said we wanted, and we can see that the 5 steps mentioned above, exactly get what we wanted!

How to use Relative Positional Encoding?

So now our relative positional encoding embedding is of shape (B, T, T) and our attention matrix is also (B, T, T).
We include the relative positional information by just summing them up element-wise!

Incorporating relation position information to the attention matrix

This will then be multiplied with the Value vector, and our Transformer now gives attention outputs given the relative positions of the words

# https://github.com/lucidrains/bottleneck-transformer-pytorch/blob/main/bottleneck_transformer_pytorch/bottleneck_transformer_pytorch.py
    def forward(self, fmap):
        """
        fmap is shape (B,T,D)
        """
        heads, b, c, h, w = self.heads, *fmap.shape

        q, k, v = self.to_qkv(fmap).chunk(3, dim = 1)
        q, k, v = map(lambda t: rearrange(t, 'b (h d) x y -> b h (x y) d', h = heads), (q, k, v))

        q = q * self.scale

        sim = einsum('b h i d, b h j d -> b h i j', q, k)
        #### Applying the relative positional embedding here! ####
        # self.pos_emb(q) is the output from what we did in steps (1) and steps (2)
        sim = sim + self.pos_emb(q)
        ############################################################
        attn = sim.softmax(dim = -1)

        out = einsum('b h i j, b h j d -> b h i d', attn, v)
        out = rearrange(out, 'b h (x y) d -> b (h d) x y', x = h, y = w)
        return out

Conclusion

Relative Positional Encoding is different from Absolute Positional Encoding because the embedding is based on relative distance, as opposed to embedding created independently from word positions
Relative Positional Encoding is applied on the fly to Query, Key vectors (or more efficiently, by addition to the attention matrix), while Absolute Positional Encoding is applied on the input token embeddings
We create the original relative positional encoding by matrix multiplication of the Query vector of shape (B, T, D) with a randomly initialized tensor of shape (2*T -1 , D) to give an output of shape (B, T, 2*T-1)
The final Relative Position Encoding matrix can be created efficiently via tensor matrix manipulation in 5 steps by (1) adding a zero padding on the dimension with 2*T-1 (2) flattening it (3) adding (T-1) padding (4) reshaping the matrix from (B, 2* T**2 + T-1) to (B, T+1, 2 *T -1) (5) taking the upper right-hand side square block of shape (T, T)
The final relative positional encoding matrix is added to the attention matrix