Rotary Positional Embeddings: A Detailed Look and Comprehensive Understanding

Published in

azhar labs

7 min readJan 11, 2024

Since the “Attention Is All You Need” paper in 2017, the Transformer architecture has been a cornerstone in the realm of Natural Language Processing (NLP). Its design, largely unchanged for years, has powered a wide array of language models. However, 2022 marked a significant evolution in this field with the introduction of Rotary Positional Embeddings (RoPE). This advancement, swiftly adopted by leading models. In this article, we will delve into what Rotary Positional Embeddings are and how they ingeniously blend the advantages of both absolute and relative positional embeddings.

Before we proceed, let’s stay connected! Please consider following me on Medium, and don’t forget to connect with me on LinkedIn for a regular dose of data science and deep learning insights.” 🚀📊🤖

The Need for Positional Embeddings in Transformers

To understand the importance of RoPE, let’s first revisit why positional embeddings are crucial. Transformer models, by their inherent design, do not consider the order of input tokens.

For instance, phrases like “the dog chases the pig” and “the pig chases the dog,” though distinct in meaning, are treated indistinguishably as they are perceived as an unordered set of tokens. To maintain the sequence information and thus the meaning, positional embeddings are integrated into the model.

How Absolute Positional Embeddings Work

In the context of a sentence, suppose we have an embedding representing a word. To encode its position, we use another vector of identical dimensionality, where each vector uniquely represents a position in the sentence. For instance, a specific vector is designated for the second word in a sentence. Thus, each sentence position gets its distinct vector. The input for the Transformer layer is then formed by summing the word embedding with its corresponding positional embedding.

There are primarily two methods to generate these embeddings:

Learning from Data: Here, positional vectors are learned during training, just like other model parameters. We learn a unique vector for each position, say from 1 to 512. However, this introduces a limitation — the maximum sequence length is capped. If the model only learns up to position 512, it cannot represent sequences longer than that.
Sinusoidal Functions: This method involves constructing unique embeddings for each position using a sinusoidal function. Although the intricate details of this construction are complex, it essentially provides a unique positional embedding for every position in a sequence. Empirical studies have shown that learning from data and using sinusoidal functions offer comparable performance in real-world models.

Limitations of Absolute Positional Embeddings

Despite their widespread use, absolute positional embeddings are not without drawbacks:

Limited Sequence Length: As mentioned, if a model learns positional vectors up to a certain point, it cannot inherently represent positions beyond that limit.
Independence of Positional Embeddings: Each positional embedding is independent of others. This means that in the model’s view, the difference between positions 1 and 2 is the same as between positions 2 and 500. However, intuitively, positions 1 and 2 should be more closely related than position 500, which is significantly farther away. This lack of relative positioning can hinder the model’s ability to understand the nuances of language structure.

Understanding Relative Positional Embeddings

Rather than focusing on a token’s absolute position in a sentence, relative positional embeddings concentrate on the distances between pairs of tokens. This method doesn’t add a position vector to the word vector directly. Instead, it alters the attention mechanism to incorporate relative positional information.

Case Study: The T5 Model

One prominent model that utilizes relative positional embeddings is T5 (Text-to-Text Transfer Transformer). T5 introduces a nuanced way of handling positional information:

Bias for Positional Offsets: T5 uses a bias, a floating-point number, to represent each possible positional offset. For example, a bias B1 might represent the relative distance between any two tokens that are one position apart, regardless of their absolute positions in the sentence.
Integration in Self-Attention Layer: This matrix of relative position biases is added to the product of the query and key matrices in the self-attention layer. This ensures that tokens at the same relative distance are always represented by the same bias, regardless of their position in the sequence.
Scalability: A significant advantage of this method is its scalability. It can extend to arbitrarily long sequences, a clear benefit over absolute positional embeddings.

Challenges with Relative Positional Embeddings

Despite their theoretical appeal, relative positional embeddings pose certain practical challenge.

Performance Issues: Benchmarks comparing T5’s relative embeddings with other types have shown that they can be slower, particularly for longer sequences. This is primarily due to the additional computational step in the self-attention layer, where the positional matrix is added to the query-key matrix.
Complexity in Key-Value Cache Usage: As each additional token alters the embedding for every other token, this complicates the effective use of key-value caches in Transformers. For those unfamiliar, key-value caches are crucial in enhancing efficiency and speed in Transformer models.

Due to these engineering complexities, relative embeddings haven’t been widely adopted, especially in larger language models.

What Are Rotary Positional Embeddings (RoPE)?

RoPE represents a novel approach in encoding positional information. Traditional methods, either absolute or relative, come with their limitations. Absolute positional embeddings assign a unique vector to each position, which though straightforward, doesn’t scale well and fails to capture relative positions effectively. Relative embeddings, on the other hand, focus on the distance between tokens, enhancing the model’s understanding of token relationships but complicating the model architecture.

RoPE ingeniously combines the strengths of both. It encodes positional information in a way that allows the model to understand both the absolute position of tokens and their relative distances. This is achieved through a rotational mechanism, where each position in the sequence is represented by a rotation in the embedding space. The elegance of RoPE lies in its simplicity and efficiency, enabling models to better grasp the nuances of language syntax and semantics.

The Mechanism of Rotary Positional Embeddings

RoPE introduces a novel concept. Instead of adding a positional vector, it applies a rotation to the word vector. Imagine a two-dimensional word vector for “dog.” To encode its position in a sentence, RoPE rotates this vector. The angle of rotation (θ) is proportional to the word’s position in the sentence. For instance, the vector is rotated by θ for the first position, 2θ for the second, and so on. This approach has several benefits:

Stability of Vectors: Adding tokens at the end of a sentence doesn’t affect the vectors for words at the beginning, facilitating efficient caching.
Preservation of Relative Positions: If two words, say “pig” and “dog,” maintain the same relative distance in different contexts, their vectors are rotated by the same amount. This ensures that the angle, and consequently the dot product between these vectors, remains constant

Matrix Formulation of RoPE

The technical implementation of RoPE involves rotation matrices. In a 2D case, the equation from the paper incorporates a rotation matrix that rotates a vector by an angle of Mθ, where M is the absolute position in the sentence. This rotation is applied to the query and key vectors in the self-attention mechanism of the Transformer.

For higher dimensions, the vector is split into 2D chunks, and each pair is rotated independently. This can be visualized as an n-dimensional corkscrew rotating in space.

Practically, this is implemented efficiently in libraries like PyTorch, requiring only about ten lines of code.

import torch
import torch.nn as nn

class RotaryPositionalEmbedding(nn.Module):
    def __init__(self, d_model, max_seq_len):
        super(RotaryPositionalEmbedding, self).__init__()

        # Create a rotation matrix.
        self.rotation_matrix = torch.zeros(d_model, d_model, device=torch.device("cuda"))
        for i in range(d_model):
            for j in range(d_model):
                self.rotation_matrix[i, j] = torch.cos(i * j * 0.01)

        # Create a positional embedding matrix.
        self.positional_embedding = torch.zeros(max_seq_len, d_model, device=torch.device("cuda"))
        for i in range(max_seq_len):
            for j in range(d_model):
                self.positional_embedding[i, j] = torch.cos(i * j * 0.01)

    def forward(self, x):
        """
        Args:
            x: A tensor of shape (batch_size, seq_len, d_model).

        Returns:
            A tensor of shape (batch_size, seq_len, d_model).
        """

        # Add the positional embedding to the input tensor.
        x += self.positional_embedding

        # Apply the rotation matrix to the input tensor.
        x = torch.matmul(x, self.rotation_matrix)

        return x

The rotation is executed through simple vector operations rather than matrix multiplication for efficiency. An important property is that words closer together are more likely to have a higher dot product, while those far apart have a lower one, reflecting their relative relevance in a given context.

Experiments with models like RoBERTa and Performer, using RoPE, have shown faster training times compared to sinusoidal embeddings. This method has proven robust across various architectures and training setups.

In summary, Rotary Positional Embeddings represent a paradigm shift in Transformer architecture, offering a more robust, intuitive, and scalable way of encoding positional information. This advancement not only enhances current language models but also sets a foundation for future innovations in NLP. As we continue to unravel the complexities of language and AI, approaches like RoPE will be instrumental in building more advanced, accurate, and human-like language processing systems.