Why Positional Encoding is important in Transformer Architecture?
A diagram of transformer architecture
In this article, I don’t plan to explain transformer architecture in depth as there are currently several great tutorials on this topic (here, here, and here), but alternatively, I want to discuss one specific part of the transformer’s architecture — the positional encoding.
Why need Positional Encoding?
Position and order of words are the essential parts of any language. They define the grammar and thus the actual semantics of a sentence. Recurrent Neural Networks (RNNs) inherently take the order of word into account; They parse and take input sentence word by word in a sequential manner. This will integrate the words’ order in the backbone of RNNs.
But the Transformer architecture disregarded the recurrence mechanism in favor of multi-head self-attention mechanism. Avoiding the RNNs’ method of recurrence will result in massive speed-up in the training time. And theoretically, it can capture longer dependencies in a sentence.
As each word in a sentence simultaneously flows through the Transformer’s encoder/decoder stack, The model itself doesn’t have any sense of position/order for each word. Consequently, there’s still the need for a way to incorporate the order of the words into our model.
How to add Positional References of data?
The formula for positional encoding used in the Transformer paper, “Attention is All You Need” by Vaswani et al., is defined as follows:
Explanation
- pos is the position of the token in the sequence.
- i is the dimension index.
- d model is the dimension of the model (the number of dimensions in the embedding vector).
If we take an example sentence “I love cats” ,the sentence has to tokenized and mapped to a numerical vocabulary.
Output:
So, our input sentence “I love cats” was tokenized into a list of tokens [“I”, “love”, “cats”] and then mapped to numerical values [0, 1, 2].
Relating our sentence to the positional encoding formula, pos
represents the position of the word. For example, the position (pos
) of "I" is 0, the position of "love" is 1, and the position of "cats" is 2.
Now, for the sake of simplicity, let’s assume we want our model dimension (d_model
), or the dimension of our positional encoding vector, to be equal to 4. This means each position will have a corresponding positional encoding vector.
For position 0 (“I”), position 1 (“love”), and position 2 (“cats”), the vectors will follow the pattern:
where the first element of each tuple in the list represents the position (pos
), which stays the same as the word's position in the sentence, and the second element represents the dimension index (i
) in the d_model
When I say for even dimensions we use sine function and for odd dimensions we use cosine function , this corresponds to the ith dimension of d_model which are the second element of tuples for reference
So, here are the positional encoded vectors of the sentence “I love cats” where d_model is 4
Position 0
[0,1,0,1]
Position 1
[0.8415,0.5403,0.01,0.99995]
Position 2
[0.9093,−0.4161,0.02,0.9998]
Now, this positional encoded vector for each token is added to its word embedding vector and a new input embedding vector is generated via element wise addition of the above mentioned vectors which is the input to our transformer model.
How Positional Information is Maintained
By adding the positional embeddings to the word embeddings, each input embedding now contains both the semantic meaning of the word and its positional context. Here’s how this addition helps:
- Semantic Meaning: The word embedding part of the input embedding vector retains the semantic information about the word.
- Positional Context: The positional embedding part of the input embedding vector encodes the word’s position within the sequence.
This combination allows the Transformer model to distinguish between identical words appearing in different positions and understand their roles in different contexts. The model’s self-attention mechanism can then use these enriched input embeddings to focus on relevant parts of the sequence, taking into account both the meaning and the position of each word.
References:
The illustrative transformer: https://jalammar.github.io/illustrated-transformer/
LLMs from scratch:https://github.com/rasbt/LLMs-from-scratch/blob/main/ch02/01_main-chapter-code/ch02.ipynb