Positional Encoding in the Transformer Model

Sandaruwan Herath
Data Science and Machine Learning
6 min readApr 17, 2024

The positional encoding component of the Transformer model is vital as it adds information about the order of words in a sequence to the model’s inputs. Unlike some older sequence processing models that inherently understand sequence order (e.g., RNNs), the Transformer uses positional encodings to maintain this essential context.

Challenge of Positional Information

The Transformer model structures its inputs using vectors that combine word embeddings with positional encodings. Each input token is represented by a vector of fixed size dmodel =512, which includes both the embedded representation of the token and its positional encoding. This method contrasts with models that might use separate vectors or additional parameters to encode position, as doing so could significantly slow down training and complicate the model architecture.

Implementation of Positional Encoding

The Transformer embeds positional information by adding positional encoding vectors to the corresponding word embeddings. These positional vectors use sine and cosine functions of different frequencies to encode the absolute position of a token within a sentence.

The formulae for these functions are:

where pos is the position of the token in the sequence and i is the dimension. Each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from 2π to 10000⋅2π. This design allows the model to easily learn to attend by relative positions since for any fixed offset k, PE pos+k can be represented as a linear function of PE pos.

Example of Positional Encoding Application

Revisiting the sentence from the input embedding section: “Dolphins leap gracefully over waves.”

The tokenized form we considered was: [‘Dolphins’, ‘leap’, ‘grace’, ‘##fully’, ‘over’, ‘waves’, ‘.’]

We assign positional indices starting from 0:

· “Dolphins” at position 0

· “leap” at position 1

· “grace” at position 2

· “##fully” at position 3

· “over” at position 4

· “waves” at position 5

· “.” at position 6

The positional encoding for “Dolphins” (pos=0) and “waves” (pos=5) would look like:

For pos=0 (Dolphins): PE(0)=[sin(0),cos(0),sin(0),cos(0),…,sin(0),cos(0)]

For pos=5 (waves): PE(5)=[sin(100000/5125​),cos(100000/5125​),sin(100002/5125​),cos(100002/5125​),…,sin(10000510/5125​),cos(10000510/5125​)]

These vectors are added to the respective word embedding vectors to provide positional context, crucial for tasks involving understanding the sequence of words, such as in language translation.

Visualization and Impact

You can visualize the positional encoding values using a plot, which would typically show a sinusoidal wave that varies across the dimensions, giving a unique pattern for each position. This pattern helps the model discern the positional differences between words in a sentence, enhancing its ability to understand and generate contextually relevant text.

Such visualizations underscore the variability and specificity of positional encoding in helping the Transformer model recognize word order — a fundamental aspect of human language that is crucial for meaningful communication.

Enhancing Word Embeddings with Positional Encodings

In the Transformer model, positional encodings provide the necessary context that word embeddings lack in their context which is crucial for tasks involving the sequence of words, such as language translation or sentence structure analysis. The authors of the Transformer introduced a method to integrate positional information by directly adding the positional encoding vectors to the word embedding vectors.

Process of Combining Positional Encodings and Word Embeddings

The positional encoding vector is added to the corresponding word embedding vector to form a combined representation, which carries both semantic and positional information. This addition is performed elementwise between two vectors of the same dimension (dmodel =512).

Consider the sentence:

“Eagles soar above the clouds during migration.”

Let’s focus on the word “soar” positioned at index 1 in the tokenized sentence [‘Eagles’, ‘soar’, ‘above’, ‘the’, ‘clouds’, ‘during’, ‘migration’].

Word Embedding:

The embedding for “soar” might look like a 512-dimensional vector:

y1 = embed(‘soar’) = [0.12, -0.48, 0.85, …, -0.03]

Positional Encoding:

Using the positional encoding formula, we calculate a vector for position 1:

pe(1) = [sin(1/10000^(0/512)), cos(1/10000^(0/512)), sin(1/10000^(2/512)), …, cos(1/10000^(510/512))]

Combining Vectors:

The embedding vector y1 and positional encoding vector pe(1) are combined by element-wise addition to form the final encoded vector for “soar”:

pc(soar) = y1 + pe(1)

This results in a new vector pc(soar) that retains the semantic meaning of “soar” but also embeds its position within the sentence.

Detailed Combination Example with Python

The positional encoding addition can be visualized through the following Python-like pseudocode, which assumes d_model = 512 and position pos = 1 for “soar”:

def positional_encoding(pos, d_model=512):
pe = np.zeros((1, d_model))
for i in range(0, d_model, 2):
pe[0][i] = math.sin(pos / (10000 ** ((2 * i)/d_model)))
pe[0][i+1] = math.cos(pos / (10000 ** ((2 * i)/d_model)))
return pe
# Assume y1 is the embedding vector for "soar"
y1 = np.array([0.12, -0.48, 0.85, ..., -0.03]) # This should have 512 elements
pe = positional_encoding(1)
pc = y1 + pe # Element-wise addition

Once the positional encoding is added, the new vector pc(soar) might look something like this (simplified):

[0.95, -0.87, 1.82, …, 0.45]

These values now incorporate both the intrinsic semantic properties of “soar” and its positional context within the sentence, significantly enriching the data available to the model for processing.

Evaluating Changes through Cosine Similarity

To understand the impact of positional encodings, consider another word “migration” at position 6 in the same sentence. If we calculate the cosine similarity between pc(soar) and pc(migration), it will reflect not just the semantic similarity but also the relative positional differences:

from sklearn.metrics.pairwise import cosine_similarity

cosine_similarity(pc(soar), pc(migration)) # Expected to be less than the similarity of embeddings alone due to positional differences

This similarity is typically lower than that of the raw embeddings (assuming non-adjacent positions), illustrating how positional encodings can distinguish words based on their locations in the text, despite semantic similarities.

Summary of Positional Encoding

The Transformer model leverages positional encodings integrated with word embeddings to masterfully capture and utilize sequence information. This technique allows the model to bypass the limitations of recurrent architectures, achieving high efficiency and scalability. Positional encodings utilize sinusoidal functions, enabling the Transformer to effectively handle sequences longer than those seen during training by interpolating and extrapolating positional data. This capability ensures consistent performance across various and evolving textual contexts.

By embedding both the semantic meanings and the sequential order of words, the Transformer prepares its inputs meticulously for subsequent processing layers. This dual-embedding approach ensures the model comprehends not only the content (‘what’) but also the contextual placement (‘where’) of the words within the sequence. Such comprehensive understanding is pivotal for precise language interpretation and generation.

The enriched inputs then advance to the multi-head attention sublayer, where the Transformer’s sophisticated computational mechanisms further refine the data, drawing on the intricate interplay between different attention heads to enhance the model’s output accuracy and contextual relevance.

This strategic integration of positional encodings with word embeddings underscores the Transformer’s advanced capability to navigate and interpret complex linguistic data, setting a new standard for machine learning models in natural language processing tasks.

Next

The multi-head attention mechanism is a hallmark of the Transformer model’s innovative approach to handling sequential data. It enhances the model’s ability to process sequences by enabling it to attend to different parts of the sequence simultaneously. This article will delve into the architecture of the multi-head attention sublayer, its implementation in Python, and the role of post-layer normalization.

References

[1]. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998–6008).

Rothman, D. (2024). Transformers for Natural Language Processing and Computer Vision. Packt Publishing.

--

--

Sandaruwan Herath
Data Science and Machine Learning

IT Consultant/Lecturer | Data Analyst/BI Consultant/Machine Learning