Decoding Transformers : Inside Positional Encoding

Himanshu Kale
5 min readJun 13, 2024

--

Hey Folks, In our last blog we discussed about how the Self Attention block actually works and its significance in the Transformer Architecture. In this blog we will continue our series “Decoding Transformers” a very untouched part “Positional Encoding”. So fasten your seat belts and lets start !!

Photo by Mark König on Unsplash

The Why ?
We have seen that self attention blocks convert our static word embedding to dynamic and contextual ones. Such a powerful block , Isn’t it ?? But we also know that it has a drawback , it lacks to retain the sequential information. In NLP related tasks, Sequential information plays a very vital role to understand things. When a Self Attention blocks process a sentences like :
1. Aryan killed Lion &
2. Lion killed Aryan
In both the context the Self Attention block considers both of them the same. This issue need to be addressed !! So if we can also send the sequential information to the Self Attention, can solve our problem :)

The How ?
The first thing I can think of is annotating each word in the sentence by a unique number like 1,2,3,4 …. and if we concatenate this position of the word with the word embedding , we get a new vector for the Self Attention block. Unfortunately this approach has 3 major drawbacks

  • The numbers are unbounded, so if the text has 1 million words the annotation will go to 1 million and with this numbers in the training process the back-propagation can be hectic !!
  • Neural Networks prefer to have smooth transitions and these values are discrete.
  • Most importantly , we can’t capture the relative positioning with this approach.

So, finally we know what we need !! A mathematical function that is bounded, continuous and periodic.
The best function I can remember is Trigonometric Sine Function that has all these properties.

Now, If we use a sinusoidal function of a certain frequency to represent our positional information along with the embedding vector. So every position in the sequence can be mapped to the corresponding value in the sine function.

This has solved most of our problems as Sine Function is bounded in [-1,1], continuous and captures the relative positioning. But if you take a close look you might come across a huge problem i.e. for long textual data or sequence two words can have the same value as the function is continuous.
But this problem can be addressed easily by using multiple functions. How ?

Instead of taking only sin(x) to represent the information of position we can take one more function cos(x) and which gives us a two dimensional representation of our position.
Previously where n'th position was represented by a single value [0.84] now will be represented by a 2-dimensional array [0.84 , 0.5]. But has it solved our problem. There is still some possibility that somewhere down the line the 2-dimensional array would be exactly repeating then in that case, we can extend our same idea of using more functions to represent the positions.

So when we want to send the embedding into the self attention we also want to pass the positional embedding, but what must be the dimension??
It must be equal to the dimension of the embedding of the word”

Now there are two ways we can combine them : Concatenation and Addition. If Concatenation is considered the dimensionality would increase hence increasing the number of parameters in the model and also training time would also be doubled. So we know what we have “Addition”.
We will take a word embedding add it with the same dimension positional embedding and get a embedding for the self attention block to process.

The only question remains is how should this positional embedding be calculated ?

The answer is very simple, just keep on adding multiple sine cosine pairs till we get our desired dimension. If our word embedding is of 6 dimension then our position embedding can be something like,
[sin(x) , cos(x), sin(x/2) , cos(x/2), sin(x/4) , cos(x/4)] .

Choosing the frequency would be a wise thing to do !! This is given by,

“Attention Is All You Need” Original Paper

where ,
pos : position of the word
dmodel : dimension of the embedding
i : range of 0 to (dmodel/2 ) -1

Let’s try to see how these embedding in 6 dimension would look like using the above formula, Let us consider a sequence of two words “River bank” and try to find its positional encoding,

import math
import numpy as np
import pandas as pd

def position_cal(pos,i,dmodel):

x1 = math.sin(pos/10000**(2*i/dmodel))
x2 = math.cos(pos/10000**(2*i/dmodel))
return x1,x2

def get_position_encoding(pos,dmodel):

d = []
alpha = (dmodel/2)

for i in range(int(alpha)):
x1,x2 = position_cal(pos,i,dmodel)
d.extend([x1,x2])
return d

matrix = []
num_words = 2

for pos in range(num_words):
embed = get_position_encoding(pos,6)
matrix.append(embed)

matrix = np.array(matrix)

df = pd.DataFrame(data_array, index=['River', 'Bank'], columns=['PE1', 'PE2', 'PE3', 'PE4', 'PE5', 'PE6'])
pd.options.display.float_format = '{:.4f}'.format
Output of the code

Seems to be quite interesting !! Lets extend this , a text may have many words and higher dimension of embedding for the words. There must be some pattern in this to be captured , Lets go for it too!!

Consider our text sequence has 50 words and our embedding dimension is 128 so our positional encoding matrix would look something like this,

Somewhere after 60th dimension all the positional embedding looks similar, that means the initial dimension generally tend to create the uniqueness in the positional embedding. If go on to increase the dimension and for longer sequences the uniqueness will still be valid.

And there we have it, folks! That’s the wrap on positional encoding — until next time. Check out my other blogs of this series on Decoding Transformers to keep the learning party going! Thank You !!

--

--

Himanshu Kale

Associate Data Scientist @ Neurologic AI Systems Pvt. Ltd.| Masters from IIT Kharagpur | Machine Learning and Deep Learning Enthusiast