Published in

The Startup

6 min readJun 3, 2020

How do we capture position and order of data?

Key Intuition Behind Positional Encodings

Introduction:

Understanding the position and order is crucial in many tasks that involve sequences. Positional encoding play a crucial role in the widely known Transformer model (Vaswani, et al. 2019) because the architecture doesn’t naturally include the information about order of the input. The positional encoding step allows the model to recognize which part of the sequence an input belongs to.

My intent in writing this post is to help students and practitioners, like myself, to gain a stronger grasp of the intuition behind the formulation of Transformer’s positional embeddings. Hopefully, a thorough understanding can help us develop the ability to make tweaks and adjustments according to each use case and push the boundaries of research.

What are Positional Embeddings and where are they used?

Transformer Model (Vaswani, et al. 2017)

At a higher level, the positional embedding is a tensor of values, where each row represents the position of a word in a sequence, which are added to input embeddings to produce a final embedding with order information.

As shown above model figure, positional embeddings are added to the input before the encoder and decoder layer because the structure of the Transformer does not take the order of the input sequence into account. We need to apply the positional encoding prior to the decoder as well; since the output of the transformer is a sequence of word embeddings, which has lost all information about the position of elements in a sequence.

Formulation:

In this section, we will assume that the task is language modelling, where the input and the output are both sequences of words.

Given a sequence of words, we process into word embeddings Zʷ: N x hʷ, N represents the number of words in a sampled sequence, hʷ represents the embedding size. Then, pos ∈ [0, N-1] is the position of the word in the sequence and i ∈ [0, hʷ-1] is the index which spans the dimension of the word embedding.

To reiterate:

Given: word embeddings Zʷ: N x hʷ

N: Number of word in the sequence
hʷ: Dimension size of word embedding
pos: position of the current word in the sequence in [0, N-1]
i: index of the dimensional index of word embedding in [0, hʷ-1]

Thus, the formula for the positional embedding is:

It is easy to see that the frequency of the sin and cos functions is determined by the dimensional index (hʷ is fixed).

Once we calculate the positional encoding, we simply add the word (via standard element-wise addition) embedding as shown below:

For example, let’s assume hʷ = 4 and we want to calculate the final embedding with positional embedding added:

Its important to note that the dimensions of Zʷ and PE are identical. This can be easily seen since pos spans N and i spans the embedding dimension hʷ.

Key Intuition:

In order to capture positional information, each element of the positional embedding varies according to a word’s position and the index of the element within the dimension of the word embedding hʷ (in this case between [0, 299]). This is achieved by varying frequencies, as mentioned above.

To further illustrate this notion:

Let us look at the positional embedding values per dimensional index.

As shown above, the positional encoding for each dimensional index demonstrates a noticeable sinusoidal pattern. Furthermore, the values in higher indices become constant, which is evident in the formulation (as i approaches infinity, sin(pos/(a^i)) approaches 0 and cos(pos/(a ^i)) approaches 1, where a is a constant). The figures above illustrate that the positional encoding values, with respect to dimensional index of word embeddings, exhibits a pattern. By itself, it captures little to no information. However, this attribute is essential to creating a pattern in positional embeddings with respect to the position of each word, which is exactly what we want to do.

The above figure shows positional embedding values for all indices of word embedding dimension for a given position of a word in the sequence. The above patterns are the distinguishing characteristics of positional embeddings, and ultimately what “encodes” positional information. Each row demonstrates a unique but recognizable pattern, which changes according to the position (as indicated in the y-axis), hence capturing the information of the position of each word.

Resulting Word Embedding After Adding Positional Embedding

For illustration purposes, the above plot shows the resulting tensor from adding positional embedding to a dummy word embedding, which is a random tensor with matching dimensions with elements ranging between 0 and 1.

In the implementation of original Attention is All you Need (Vaswani, et al. 2017) paper, the positional embedding is added to the input word embedding. The above figure visualizes the result of the mentioned additive process. We see that an inherent pattern of the positional encoding, which captures the positional information, exists. The resulting word embedding for each position (rows in the above plot) is subject to a unique pattern of values in its embedding space that occurs as a result of varying frequencies, which is altered by the dimensional index (i).

This is analogous to telecommunications and signal processing where frequency modulation(FM) is used to encode information in a carrier wave by varying the frequency of the wave.

PyTorch Implementation:

The current PyTorch Transformer Module (nn.Transformer, nn.TransformerEncoder, nn.TransformerDecoder…) does not include positional encoding. In order to include it, you must implement it yourself or you can use the following code, which is listed as an example on PyTorch Github:
https://github.com/pytorch/examples/blob/master/word_language_model/model.py

class PositionalEncoding(nn.Module):    
def __init__(self, d_model, dropout=0.1, max_len=5000):
  super(PositionalEncoding, self).__init__()
  self.dropout = nn.Dropout(p=dropout)
  pe = torch.zeros(max_len, d_model)
  position = torch.arange(0, max_len,dtype=torch.float).unsqueeze(1)
  div_term = torch.exp(torch.arange(0, d_model, 2).float()*(-math.log(10000.0) / d_model))        
  pe[:, 0::2] = torch.sin(position * div_term)        
  pe[:, 1::2] = torch.cos(position * div_term)        
  pe = pe.unsqueeze(0).transpose(0, 1)        
  self.register_buffer('pe', pe)def forward(self, x):
  x = x + self.pe[:x.size(0), :]        
  return self.dropout(x)

Conclusion

Transformers are widely used due to its high performance and intuitive ideas, which is at the heart of its model structure. In this post, we’ve explicitly discussed the elegant intuition behind the positional encodings.

I hope that this post is of use to ML practitioners and students like myself who are curious about every aspect of a model.

References:

[1] Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems. 2017.

[2]Takase, Sho, and Naoaki Okazaki. "Positional encoding to control output sequence length." arXiv preprint arXiv:1904.07418 (2019).

[3] Gehring, Jonas, et al. “Convolutional sequence to sequence learning.” Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017.

[4]https://kazemnejad.com/blog/transformer_architecture_positional_encoding/