Explaining the need for positional encodings in Transformers

Why it works from an intuitive perspective

7 min readJan 6, 2024

Positional encodings are widely used in NLP because the position of words in a sentence is important in influencing the semantics of the sentence.

I have always had an intuitive understanding of the need for positional encodings, but in this article, I would like to deep dive into how exactly it influences the attention output from the Transformer, mainly using absolute positional encodings. (There are other positional encodings, which I may cover in future articles)

In this article, there will be three parts:

Why the need for a positional embedding
Why absolute positional encoding works
Implementation of Absolute Positional Embedding in Pytorch

Why do we need to add positional encoding?

The reason why we need to add positional encoding is because the Attention mechanism in the Transformer does not take into account the position of a word in the sequence

Some Primer on Attention:
Recall that Query, Key, and Value vectors of shape (B, T, D), where B is the batch_size and T is the time_step, and where D is the hidden_dimension are created from a linear transformation of the input token_embeddings of shape (B, T, D)
The attention matrix of shape (B, T, T) is given by the matrix multiplication of the Query of shape (B, T, D) and Key. transpose(1,2) of shape (B, D, T), giving the attention matrix of shape (B, T, T)
The attention matrix of shape (B, T, T) is then matrix multiplied by the Value of shape (B, T, D), giving us an output of shape (B, T, D)
Suppose we have 3 words in the sequence: [this, is awesome], batch_size=1, the attention matrix[0][0] is of shape (T=3), which tells us how similar the word this is to [this, is awesome].

An example:

Given that we have 3 words in the sequence: [this, is awesome].
Suppose the token embedding of dimension=1 for these three words are: [this=3, is=2, awesome=1]
Assume that we already calculated the attention matrix of shape (T, T)
(since this is only a single example, batch_size=1, we can omit ‘B’ from the shape)
The attention matrix given for the word is to all other words are: [this=4, is=5, awesome=6]
Let us see what is the self-attention output for the word this in the given sentence [this, is, awesome] and [is, this, awesome]
In the first sentence, this is in position 1 ([this, is, awesome]):
: Attention output for this = [this=3] * [this=4, is=5, awesome=6]
= [3*4, 3*5, 3*6] = [12, 15, 18]
In the second sentence, this is in position 2([is, this, awesome])
: Attention output for this = [this=3] * [this=4, is=5, awesome=6]
= [3*4, 3*5, 3*6] = [12, 15, 18]
We can see that regardless of whether the word this comes first or this comes last, the attention output for this will be the same regardless of its position.
This is not desirable since we know that the meaning of a sentence depends on the position of the word, which is why positional encodings need to come in.

Absolute Positional Encoding

What are Positional Encodings?

In layman’s terms, positional encodings are just an embedding matrix that represents each position or time step in a sequence with some chosen hidden dimensions
Honestly speaking, it is not different from a categorical variable, where the unique values are integers ranging from 0 to maximum length, and we create an embedding matrix for it.
I tend to think of positional encodings as a categorical variable, just that there are specific techniques that researchers are using to create the embedding to help the model capture more positional information

What makes it “absolute”

We describe a positional encoding as ‘absolute’ when the encoding is independent of the positions of other tokens in the sequence. This means that we do not care about the relationship from one-time step to another time step.

How does it include positional information?

The absolute positional encodings are just added element-wise to the token embeddings (before being used to create the Query, Key, and Value vectors).
(If you ask why are we using addition and not concatenation, the consensus from the blogs I read is that there isn’t a definite answer. You can read the FAQ in this blog for some hypothesis on why it works)

An Example:

Given that we have 3 words in the sequence: [this, is awesome].
The token embedding of dimension=1 for these three words are: [this=3, is=2, awesome=1]
Assume that we already calculated the attention matrix of shape (T, T)
(since this is only a single example, batch_size=1, we can omit ‘B’ from the shape)
The attention matrix given for the word is to all other words are: [this=4, is=5, awesome=6]
Now if we add an absolute position encoding: [pos1=0.2, pos2=0.5, pos3=0.7] (where position 1 = 0.2, position 2 = 0.5, position 3 = 0.7)
Let us see what happens when we add the absolute positional encoding to the embedding representation for the word this in the given two sentences [this, is, awesome] and [is, this, awesome].
In the first sentence, this is in position 1 ([this, is, awesome]
: [this=3, is=2, awesome=1] + [pos1=0.2, pos2=0.5, pos3=0.7]
= [this=3+0.2, is=2+0.5, awesome=1+0.7]
→ Embedding(this) = 3.2
In the second sentence, this is in position 2: ([is, this, awesome])
: [is=2, this=3, awesome=1] + [pos1=0.2, pos2=0.5, pos3=0.7]
= is=2+0.2, this=3+0.5, awesome=1+0.7]
→ Embedding(this) = 3.5
Let us see what is the self-attention output for the word this in the given sentence [this, is, awesome] and [is, this, awesome]
In the first sentence, this is in position 1 ([this, is, awesome]). We also know that the token embedding + positional embedding for this is 3.2, therefore the attention output for this= [this=3.2] * [this=4, is=5, awesome=6] = [3.2*4, 3.2*5, 3.2*6] = [12.8, 16.0, 19.2]
In the second sentence, this is in position 2: ([is, this, awesome]).
We also know that the token embedding + positional embedding for this is 3.5, therefore the attention output for this = [this=3.5] * [this=4, is=5, awesome=6] = [3.5*4, 3.5*5, 3.5*6] = [14.0, 17.5, 21.0]
Contrary to what we see when absolute positional encoding is not added, we can see that the attention output for the word this is different, and depends on the position of the word this.
— if this is at position 1, then attention output is [12.8, 16.0, 19.2].
— if this is at position 2, then attention output is [14.0, 17.5, 21.0].
We have now encoded positional information to the attention mechanism, just by simply adding numbers that dictate the position of the word
Now that the attention of the word this is dependent on the position of the this in the sentence, the transformer can now more easily differentiate between the same words at different positions, hopefully leading to better results!

Note that from this example, we know that we just have to have a different value for each position. For example, if the position encoding is the same at positions 2 and 10, then the attention output for positions 2 and 10 will be the same, which is not what we want.
If you are like me, you will most probably think of why not just add the position values itself, i.e. torch.arange(0,max_sequence_length). This doesnt work because the values can get very large. For example if you max_sequence_length is 1024, then you are adding 1024 to something very small since token embeddings are initialized with mean = 0.

Implementation in Pytorch

There are a few methods to do it:

Absolute encoding can be just a learnable embedding matrix indicated by nn.Embedding(sequence_length, hidden_dim), where each time step is randomly initialized and the neural network will learn the best representation for them
But the common way to do it is from the original “Attention is all you need” paper, where authors made use of sinusoidal functions.

Intuition for sine and cosine functions

I highly recommend reading the comment below made by one user which gave a very convincing intuitive explanation of why this works

Jiaxuan Wang (https://kazemnejad.com/blog/transformer_architecture_positional_encoding/)
I think a more intuitive explanation of positional embedding is to think about it as a clock (as cos and sin are just concept from unit circle).
Every two dimension of the positional embedding just specifies one of the clock’s hand (the hour hand, the minute hand, the second hand, for example).
Then moving from one position to the next position is just rotating those hands at different frequencies. Thus, without formal proof, it immediately tells you why a rotation matrix exist.

class PositionalEncoding(nn.Module):
    "Implement the PE function."
    def __init__(self, d_model, dropout=0, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)
        
        # Compute the positional encodings once in log space.
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) *
                             -(math.log(10000.0) / d_model))
        
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)
        
    def forward(self, x):
        x = x + (self.pe[:, :x.size(1)])
        return self.dropout(x)

x = torch.randn(2,3,4) # token embeddings (B,T,D)
pe = PositionalEncoding(4, max_len=3) # (D)
pe(x)

Conclusion

Positional encodings are just an embedding matrix that represents each position or time step in a sequence with some chosen hidden dimensions
We need positional encoding because the Attention mechanism in the Transformer does not take into account the position of the word in sequence as shown in the example, leading to the same words in different positions having the same attention output
Absolute positional encoding just involves assigning a specific value at each time step or position and then adding them to the token embedding. This works because having different values at different positions will change the original token embedding based on the position of the word. i.e. Embedding(word) is now a function of its position.
Absolute Positional Encoding works by giving each position a different value.