Positional Encodings: let us revise together

Rrohan.Arrora
AI n U
6 min readAug 28, 2024

--

What you think is important is actually surprising…

Photo by Mark König on Unsplash

This blog aims to provide you with an insight into what are positional encodings and how they are calculated.

Before we dive into the above concepts, let's review the basics.

What is self-attention?

Ans: Embeddings are always a better choice to use, but embeddings capture “average meaning.” They behave the same in different sentences with respect to the semantic meaning, irrespective of the context of the sentence.

i.e., an apple a day keeps the doctor away, and Apple is a multinational company.

Here, embedding for Apple will capture the average meaning rather than the sentence in which it is used.

All in all, we need to capture the context of the sentence, and self-attention helps in that. Self-attention not only captures the context of the sentence and produces contextual embeddings, but this whole process of calculating the contextual embeddings occurs in parallel.

I hope you already know what happens inside the Self-attention block.

But self-attention is not the sole winner. Self-attention may help us capture the contextual meaning of the sentence, but which word is in what order or what position? Self-attention does not tell us that.

i.e., the tiger killed the king or the king killed the tiger.

For both of the above sentences, it will capture the same context (when both sentences are completely opposite), and the whole process will happen in parallel.

And here is where POSITIONAL ENCODING comes into play.

Positional Encodings

Let us start with some scenario. You have to place students in a row for the assembly, and you have to ensure that Rohan should always place after Maria, Maria is always placed after John, and there are a total of n students. The best way to ensure this is to give positions to the students, which will be easy to remember.

The same way, we will add a number corresponding to the position of the work in the embedding before giving it to the self-attention block, and it resolves our problem.

Wait, wait, wait,

It is not that easy. It might be a better solution, but it is not the best solution as it has few problems.

Problem 1: Imagine a document consisting of thousands of words. It will be unbounded as the last word will have a large position number. And NN hates big numbers.

Solution: I heard you, and you suggested normalising the positions. Fine, let us normalise the positions.

Let us take an example and understand this. (I know it is simple, but hold on with me 😅)

There are 2 sentences.

I love machine learning — for this sentence, the normalized value for machine will be 3/4.

I hate machines—for this sentence, the normalized value will be 3/3 for machines, which will be 1.

Now for the same position, the normalised value is different, and this creates a discrepancy in the training data, and NN hates this(😿).

Problem 2: Descrete numbers are used as positions, and NN suffers from numerical stability and gradient flow problems.

Problem 3: Relative position is not captured. Tiger word is how much distance from king, it is not captured. Only absolute position is captured.

Solution:

We want a solution that is

  1. Bounded
  2. Continuous
  3. Relative position is also captured

And periodic functions like sine is the solution.

The sine function as a solution

sine curve definitely solves the previous 3 problems.
1. It is bound.
2. It is continuous.
3. It captures relative position.

Sine function

💡 Do you think, we have achieved the solution?
Positional encoding for every word should be unique but sine curve is periodic and values may repeat after a certain time. This will lead to an even bigger problem as more than 1 word will have the same positional encoding. Even though it solves the previous 3 problems, it introduces a new one.

The easiest solution is to use more than 1 trigonometric functions. and each word will be represented a vector. The catch here is that the frequency of every sine/cosine function added is lower than the previous one, and you can see that in the figure added below.

sol.

Tiger: Vector[Sin(pos)=X, Cos(pos)=Y, Sin(pos/2)=Z, Cos(pos/2)=W]

Now, chances of numbers having same positional encoding is reduced. If the document is huge, then introduce more periodic functions, which will further reduce the chances of any two numbers having the same positional encoding.

Now, as per the research paper, attention is all you need.

  1. The positional encoding vector is the same size as the embedding vector. In the original paper, a 512-dimensional self-encoding vector is used; therefore, 256 unique pairs of sine-cosine functions are used.
  2. We add the embedding vector and positional encoding vector (rather than concatenating) and then pass it to the self-attention block.

We still need to discuss about the frequency of sine-cosine functions and as per the paper, they have given below 2 functions to decide the frequency.

How this formula comes, I also do not know. But we can apply and understand how this formula works.

Pos: refers to the position of the word, which starts from 0.
dmodel: dimensions of embeddings.
i: fluctuates from 0 to dmodel/2–1

Let us take a model where we have embeddings of 6-dimension; therefore, the positional encoding vector for TIGER will also be of 6-dimension.

for i=0,
PE(0, 0) = Sin(0/10000⁰) = 0
PE(0, 0) = Cos(0/10000⁰) = 1

for i=1,
PE(0, 2) = Sin(0/10000^(1/3)) = 0
PE(0, 3) = Cos(0/10000^(1/3)) = 1

for i=2,
PE(0, 2) = Sin(0/10000^(2/3)) = 0
PE(0, 3) = Cos(0/10000^(2/3)) = 1

So, this way we can calculate position encoding for every word and this whole process happens in parallel.

✈️: Position encoding of first word is always [0, 1, 0, 1, 0, 1... Here, it was in 6 dimension.

Now, let embedding for Tiger be [0.2, 0.4, 0.4, 0.5, 0.01, 0.09]

final vector would be = [0.2, 0.4, 0.4, 0.5, 0.01, 0.09] + [0, 1, 0, 1, 0, 1] = [0.2, 1.4, 0.4, 1.5, 0.01, 1.09].

And this final vector is an amalgam of embedding vector and positional encoding vector both, and this is fed to the self-attention block.

I hope I was able to structure it in a way that could help you. Do not foget to give a clap.

References:

  1. Attention is all you need
  2. Timo Denk’s Blog
  3. CampusX

--

--

Rrohan.Arrora
AI n U
Editor for

demystifying the theories, construing the results.