Transformers: Attention is all you need — Positional Encoding

Shravan Kumar
6 min readNov 9, 2023

Please refer to below blogs before reading this:

Introduction to Transformer Architecture

Transfomers: Attention is all you need — Overview on Self-attention

Transformers: Attention is all you need — Overview on Multi-headed attention

Transformers: Attention is all you need — Teacher Forcing and Masked attention

Transformers: Attention is all you need — Zooming into Decoder Layer

What is the need for Positional Encoding?

Positional Encoding:

The position of words in a sentence was encoded in the hidden states of RNN based sequence to sequence models.

In the case of RNN where we had seq2seq models — some postiional information was already encoded because we are relying on the notion of previous and next token or word.

In the case of transformers architecture — no such information is available to either encoder or decoder. Morever, the output from self-attention is permutation-invariant. So, it is necessary to encode the positional information. Let’s take an example — the input tokens are calculated with new representations by finding the attention weights of all other tokens and summing up of all attention weights from all the timesteps.

Now the question is if we interchange the words position in the input data for e.g., movie and transformers interchanged — does the new representation will change or not? The z1 representation will not matter much becuase it is coming from a summation. Hence the order of exchange of words does not matter with final representation remaining same. As of now it seems like the model has no notion of position — it does not know that word ‘movie’ is closer/farther away from ‘I’. So there is a requirement of knowing the position or relative position of word w.r.t other words. Hence we need the order in which the sentence is processed to get the deisred output based on the sequence.

How do we embed positional information in the word embeddings of size 512? How do we add the positional encoding for each of the word input and make the model better understand the output?

how do we fill the elements of the positional vector p0?

  • Could it be a constant vector (i.e., all elements are of constant (position) value j, for pj)

The challenge with this is that — these numbers are much larger than what you typically initialize the parameters to. Now we are adding very large numbers to our word embeddings then the original information of the word would get drowned and this bigger number will get magnified. As the number is increasing with linear scale and your embeddings does not increase in that scale then we might always face some problem in future. Hence this solution will not work.

  • Can we use one hot encoding for the position j, j = 0,1,2,3,……, T?

Let’s define the max sequence scale by 512 or 256. For every position we have a one hot encoder vector, if position 1 then encode all 0s except 1 at that location. This process will make a sparse matrix and will not look good to solve the problem. The other problem is when we are calculating the distance between two word embeddings either 1st word, 2nd word or any other word — the distance is always SQRT(2). So this does not capture the idea of position. So ideally we expect that the distance between postional embedding of 2nd word to 3rd word is lesser compared to 2nd word to 10th word.

  • Or learn embeddings for all possible positions?

Let us initialize embeddings randomly as you initialize any other parameter. This is the embedding for positon 0,1, etc.. as just we learn any other parameter we also learn the position parameter. There is a challenge with this as well as this is not suitable if the sentence length is dynamic. Now in your training data for task of translation — all sentences are of length is < 40 and now in test time we have sentences of length is 44 and beyond — then this kind of process will not work becuase during training time if the training data did not see the length beyond the 40 size then it is an issue. Hence initializing randomly with embeddings will not work but initializing the embedding with something suitable values and then do the trianing from scratch might work (We will look into this later).

So what we have done so far is the original word embeddings will not contain any position information. If the “I” appeared as first word and also as 10th word — we have same input going into the transformer but now we want to change this to ‘I’ first and ‘I’ 10th specifically mentioning about the index of the word location. The proposal will become something like take the embedding of ‘I’ and add another 512 dimensional vector which contains the position to this ‘I’.

So can we come up with a methodology that is already encoded in embedding where nearby positions have smaller distance and farther away distances have larger distance? Yes we can.

Let us discuss about the Sinusoidal encoding function:

Hypothesis: Embed a unique pattern of features for each position j and the model will learn to attend by the relative position.

How do we generate the features?

Let’s take an example of 256 max sequence length vectors with 512 dimension vector size. if we can generate a pattern on these embeddings where nearby positions will look closer to each other as opposed to farther aways positions.

j ->Position for which we want the encoding (Max sequence length of encoding here in this case it is 256. In general it is ‘T’)

i-> Reperesent each cell value in a 512 dimensional space

d model —> Dimension of transformer model = 512. It is basically the output that comes after the each one of the transformer block

For the fixed position j, we use the sin() function if i is even and cos() if i is odd. Let’s evaluate the function PE(j,i) for j = 0,1,2,….,7 and i = 0, 1, 2,…..,63.

This is how the heatmap looks like for the above function.

let’s focus on row = 0 and row =7 or row=0 and row=1. We can clearly see that 0 and 1 are more similar compared to 0 and 7th row based on the euclidean distance calculation. Let us look at distance matrix by looking at the chart below.

The interesting observation is that the distance increases on the left and right of 0 (in all the rows) and is symmetric at the centre position of the sentence.

If we look at graphically obout the position 0 and how it varies based on the distance between 0th position embedding and 2nd position embedding and so on….similarly if we look at some random position of embedding for 20th position and look towards left of it and right of it — we could see that distance will have certain pattern as mentioned above.

At every even indexed column, (i = 0,2,4,…512), we have a sine function with decreasing frequency (or increasing wavelength) as i increases.

Similarly, at odd indexed column, (i = 1,3,5,…..511), we have a cosine function with increasing frequency (or increasing wavelength) as i increases.

Please do clap 👏 or comment if you find it helpful ❤️🙏

References:

Introduction to Large Language Models — Instructor: Mitesh M. Khapra

--

--

Shravan Kumar

AI Leader | Associate Director @ Novartis. Follow me for more on AI, Data Science