A study on Attention mechanism

This study shows the impact, evolution, and performance of the attention mechanisms. The article highlights the mathematical approach to understanding the attention mechanism and practically shows how to write one. It shows the performance comparison between casual, flash, and sparse attention.

Published in

PerceptronAI

18 min readJul 14, 2024

Attention has become the core operation in modern deep-learning models. Today most of the commercial large models use the attention mechanism in their architecture whether it be vision or language. The reason attention is so widely integrated in modern deep learning architecture is because of its ability to put importance on the major part of the input. If it is a sentence then it will extract important words to form a context. If it is an image then it will focus on the important object within the image.

To understand the attention mechanism let’s take an example of ourselves. If you notice when we are looking at an object or person who seems interesting or important our visual cortex blurs out everything around it and focuses only on that object or person in high resolution. This is attention — selecting the object of interest while neglecting the irrelevant objects around.

Fig 1: Attention is all about focusing on an object of interest | Source: Hedi Alija

When it comes to sentences we can have a similar approach. For instance, if we look at the following sentence “She is eating a green apple” then we can establish a close relationship between certain words here. When we see “eating”, our intuition is to encounter a food word very soon. When we deal with sentences then attention is all about finding a contextual relationship.

Fig 2: Establishing a contextual relationship between the words | Source: Attention? Attention!

Mathematically speaking we can define attention as evaluating every information or word to find which ones are important. The words which has a strong contextual relationship are given a high score and the same is true for the opposite.

Why Attention?

Attention is effective when the information is sequential and it was for this reason that the attention mechanism was developed.

Sequential information, as the name suggests, contains information in a linearly ordered manner, which is also time-dependent. Examples of such types of data include text, weather forecast data, time-series data, protein sequences, and material sequences.

Such sequential information has a unique characteristic that other data types don’t: its inherent temporal or spatial order which allows it to encode patterns, dependencies, and context. This temporal behavior is crucial for understanding and predicting future behavior or states. It is this property that makes modeling them mathematically challenging.

Modeling time-dependent data is not easy especially when the sequence is long and dense as we find in documents. The early models could only capture a small range of context i.e., they could only establish contextual relationships between a small set of words. When challenged with longer sequences they would fail.

Recap on sequence modeling

In September 2014, Cho et al. proposed the recurrent neural network (RNN) architecture for sequence modeling. RNNs are very apt for sequence modeling and the reason is that they’re able to retain important data from the previous inputs and use that information to modify the current output.

Fig 3: Illustration of folding and unfolding of RNNs | Source: What are RNNs?

Essentially, during training the model was trained using Backpropagation Through Time (BPTT), where every input element influenced the output. In this phase the architecture is unfolded, i.e., the BPTT affects every hidden layer h. The hidden layer, which consists of the weights, shares the weight to the next layer. If we consider the sentence “I am learning RNNs” then every word will pass through an individual layer of hidden layer and that hidden layer will share its weights with the next hidden layer.

During the inference, the hidden layers from the entire network are encapsulated within a single recurrent structure. This is known as folding. You can think of it as data compression.

RNNs proposed a great idea of storing contextual information but it wasn’t feasible as the entire train was compressed to the point where there was data loss. This is why it couldn’t be effective for longer sequences.

In the same year, in December, Sutskever et al. introduced a similar architecture to handle large sequences. They utilized LSTM networks to encode and decode sequences of variable length. In this model, the encoder processes the input sequence and produces a final hidden state of a fixed context length. This hidden state contains combined information from the entire input sequence from the past. The decoder then uses this hidden state to generate the output sequence. You can say that the hidden state contains the information of the entire block of text.

Fig 4: Illustration of the working of LSTM | Source: Attention? Attention!

The key difference between RNNs and LSTMs is that RNNs can process the data one step at a time. This makes them inherently sequential and slower as well, especially for long sequences. LSTMs, on the other hand, are designed to handle long-term dependencies. This means that they retain and use information from earlier time steps in a sequence when making predictions or generating output at later time steps.

But they are still not capable enough.

The innovation of LSTM brought a very significant approach as it opened the door to compress information and use it to generate likable text. However, it was still lacking the ability to process text with much longer context.

In parallel, Bahdanau and his team understood the need for a longer contextual window for language models.

In 2014, they came up with a model that would allow it to focus on different parts of the input sequence rather than relying on a fixed-size context vector.

This was the attention mechanism where the idea was to compute attention scores between decoder and encoder states.

Attention???!!!

The attention mechanism was developed to help models — in this case, RNN– to memorize long sentences in neural machine translation (NMT). Rather than building a single context vector out of the encoder’s last hidden state like in LSTM, the attention excelled at establishing relationships between the context vector and the entire source input. See Fig 5.

Another important thing to mention is that the weights of these connections are customizable for each output element.

Attention is all about looking and selecting the most important information in a given text and storing them for latter use.

Attention became an integral part of processing sequential information but it was still coupled with Recurrent Models. While attention mechanisms showed promising results for processing sequential data, their integration with recurrent neural networks limited the ability to exploit parallelization for faster computations and better scalability fully.

Table 1: Attention mechanisms and corresponding alignment or similarity score functions | Source: Attention? Attention!

In 2017, Vaswani and his team came up with a solution to address the limitation of the recurrent networks. They decided to ditch the entire recurrent network and develop a new model called the “Transformers”. This model depended only on the attention mechanism, to be more precise a self-attention mechanism.

Fig 6: An entire architecture of the Transformer model. | Source: Attention? Attention!

The self-attention mechanism allows a model to score different parts of the input sequence when processing a specific element. The scoring helps to understand the importance of the input element with every other element in a sequence. For every element, it compares that with every other element in the sequence and calculates attention scores. These scores indicate how much focus should be placed on other elements when encoding the current element. This enables the model to capture context and relationships within the data more effectively.

How does it work?

You can check the working of Bahdanau Attention here. The following is the working of Vaswani’ attention.

Attention focuses on the most important words in a sentence to form a contextual awareness or understanding. See Fig. 2. Mathematically, it can be defined as a scoring function to compute a score or a weight for each element in the input sequence. The score is calculated using the dot product of the input query (Q) and the input key (V). The dot product aims to capture the similarity between Q and V.

Here is a general workflow,

Starting with calculating the dot-product attention where Q (queries) and K (keys) are the projections of the input sequence.

#using linear layers to produce initialize the projection layer for Q,V, and K

qkv_proj = Linear(n_embd, 3 * n_embd)

#passing the input and creating Q, K, and V
q, k, v = self.qkv_proj(x).chunk(3, dim=-1)

#applying dot product multiplication
att = dot_product(q,  k.t)

Next, we apply a softmax function to the scores to obtain a set of weights — or values that measure the importance of each and every word.

#calculating softmax 

a = softmax(att, dim=-1)

Lastly, we compute the context vector as a weighted sum of the values V.

Fig 7: A simple flowchart of the attention mechanism. | Source: Attention is all you need

To make the attention more effective the width of the mechanism was increased by creating multiple copies of the linear layer and the scale-dot product operation. This in turn made it process the input in parallel and it was termed multi-head attention.

Fig 8: A flowchart of the multi-head attention mechanism | Source: Attention is all you need

When you increase the width you give more opportunities for the input to establish a strong relationship with important words. This is self-attention where you compute attention scores between elements of the same sequence.

Visual understanding of the self-attention

Attention via the transformer architecture has now become a cornerstone of modern LLMs such as the GPT series, Gemini series, LLaMA series, Claude series, Mistral series, and many more. In these models, the attention mechanism (particularly self-attention) allows the model to weigh the importance of different words in a sentence relative to each other, enabling a deep understanding of context and semantics.

Assuming that we have a sentence “Why attention is really important in LLMs? How it help in understanding and reasoning?”, we need to break into tokens and sub-tokens.

Tokens are the basic units or words that a model uses for processing text. Sub-tokens are smaller unit or tokens.

To get a better understanding of tokens or how to create tokens we will use tokenization libraries such as tiktoken. It was developed by OpenAI and it allows you to get a GPT-style token which is quite efficient if you are developing a GPT-style transformer. If you want to create tokens and subtokens for any input sentence then use the following method:

import tiktoken

model_name = 'GPT-4o'
input = "Why attention is really important in LLMs? How it helps in understanding and reasoning?"

encoder = tiktoken.encoding_for_model("{model_name}")
tokens = encoder.enc({input})

Upon visualizing you will find how tiktoken tokenizes the input (see the image below).

You will see that the tokens are essentially integers assigned to every word in a sequence. Now given the fact that we know how attention scores are calculated let’s create a pseudo-attention mechanism by taking the tokenized input and performing using equation 1.

Fig 9: Heap map illustration of similarity score via attention weights

The heatmap above gives us a rough idea of how words are scored based on Eq 1. But ideally, it is not like that. When working with neural networks the heatmap will change. In this psuedo attention will see that words attending with themselves have a higher attention score this is evident from the diagonal of the matrix as it is almost bright in color. But when we use attention via transformer the heatmap will change. It will show which word will establish a high relationship with other words based on the context.

For instance, now we see understanding having a high attention score with understanding itself. However, it is plausible that after training the attention via the transformer architecture the heatmap might show “attention” having a high score with “understanding” and “reasoning”.

Understanding attention via PyTorch

A better way to understand the self-attention mechanism is by learning how to code it. While coding understand that the input sequences must have three duplicates corresponding to query, key, and value which you can write as:

 nn.Linear(n_embd, 3*n_embd)

The n_embd is the dimension of the embedding space. For example, if n_embd = 512, it means that each input vector has 512 dimensions. The embedding space is very important, a higher dimension corresponds to better learning capabilities but at a cost of higher computational demands.

But we usually start our attention journey by coding the token embedding layer and positional embedding layer — embedding block.

class Embeddings(nn.Module):
    def __init__(self, vocab_size, n_embd, block_size, device):
        super().__init__()
        self.wte = nn.Embedding(vocab_size, n_embd).to(device)
        self.wpe = nn.Embedding(block_size, n_embd).to(device)

    def forward(self, x):
        b, t = x.size()
        pos = torch.arange(0, t, dtype=torch.long, device=x.device)
        T = self.wte(x)
        P = self.wpe(pos).unsqueeze(0).expand(b, t, -1)
        return T+P

The __init__ method initializes the class with the following parameters:

vocab_size: It determines how many unique words are present in the input data
n_embd: The dimension of the embedding space, indicating the size of the embedding vectors.
block_size: Refers to the context length or the number of words to be processed during training and inference.

Within the constructor:

self.wte creates a word embedding layer (nn.Embedding) that converts word indices into dense vectors of size n_embd.
self.wpe creates a positional embedding layer (nn.Embedding) that provides positional information to the model, also of size n_embd.

Lastly, the forward method transforms input sequences into their corresponding embeddings.

vocab_size =  65
n_embd = 128
block_size = 32
device = 'cpu'

emb = Embeddings(vocab_size, n_embd, block_size, device)
e = emb(x)
e.shape

>>> torch.Size([8, 1024, 128])

When working with PyTorch it is crucial to remember that the shape of the output from one module and the shape of the input in the other module matters the most. This ensures error-free flow of data. Now that we have an output from the embedding model we must feed it to the attention module.

We make sure that the input of the attention is compatible with the output of the embedding model.

class CausalSelfAttention(nn.Module):
    def __init__(self, n_embd, n_head, dropout, block_size, device):
        super().__init__()
        self.n_embd = n_embd
        self.n_head = n_head
        self.dropout = dropout
        self.block_size = block_size

        self.qkv_proj = nn.Linear(n_embd, 3 * n_embd).to(device)
        self.c_proj = nn.Linear(n_embd, n_embd).to(device)
        self.attn_dropout = nn.Dropout(dropout)
        self.resid_dropout = nn.Dropout(dropout)

    def forward(self, x):
        B, T, C = x.size()
        q, k, v = self.qkv_proj(x).chunk(3, dim=-1)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)

        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
        att = F.softmax(att, dim=-1)
        att = self.attn_dropout(att)
        att_y = att @ v
        y = att_y.transpose(1, 2).contiguous().view(B, T, C)

        y = self.resid_dropout(self.c_proj(y))
        return y, att

Three important things that I want to point out from the attention block above.

Something that I already mentioned in Code Block 4 i.e., the input must be duplicated thrice for query, value, and key respectively. Using this approach proj_qkv = nn.Linear(128, 3*128) we are essentially asking the embedding layer to give three times the embedding size to accommodate all three projections.
Now we can split the projection into the Q, K, and V. A simple way to split is to use the .chunk method. However, this method can raise errors when scaling up. An alternative method would be you reshape the projection into the following shape using the:
Batch b of the output from the embedding
Sequence length t which is also referred to as the time dimension because sequences are time-dependent.
Number of attention heads n_head which is the number of times you will create a copy of the attention mechanism.
Embedding dimension d_k represents the dimensionality of each attention head’s subspace. This means that every attention copy will contain an embedding space of d_k instead of n_embd. By using d_k we are distributing the input for efficient parallel processing. This can be calculated using n_embd // n_head.
Lastly, we will reorder the qkv projection before splitting it into individual projections. This can be done using the .permute(2, 0, 3, 1, 4) method.

B, T, C = e.size()
n_embd = 128
n_head = 8
d_k = n_embd // n_head
qkv_p = qkv.reshape(B, T, 3, 8, d_k)

print(qkv_p.shape)
>>> torch.Size([8, 1024, 3, 8, 16])

qkv_p = qkv_p.permute(2, 0, 3, 1, 4)

print(qkv_p.shape)
>>> torch.Size([3, 8, 8, 1024, 16])

q, k, v = qkv_p[0], qkv_p[1], qkv_p[2]

print(q.shape)
>>> torch.Size([8, 8, 1024, 16])

Now that we have the attention we can write the code for the transformer block.

class Transformerblock(nn.Module):
    def __init__(self, attention, n_embd, n_head, dropout, block_size, device, 
                 sparse:bool=False, sparsity_pattern=None):
        super().__init__()
        if sparse:
            self.attn = attention(n_embd, n_head, dropout, block_size, sparsity_pattern, device)
        else:
            self.attn = attention(n_embd, n_head, dropout, block_size, device)
        self.ff = FeedForward(n_embd, dropout)
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        attn_output, attn_weights = self.attn(self.ln1(x))
        x = x + attn_output
        x = x + self.ff(self.ln2(x))
        return x, attn_weights

This block does two important things:

Normalizing the input passed into the multi-head attention block and out from the multi-head attention block
Providing a residual connection to preserve the original information.

Lastly, we have the transformer which creates a copy of the transformer block and sequentially sends the data from one transformer block to the other.

GPT-3, for instance, uses 96 transformer layers in its 175-billion parameter.

class TransformerModel(nn.Module):
    def __init__(self, attention, vocab_size, 
                 n_embd, 
                 n_head, 
                 num_layers, 
                 dropout, 
                 block_size, 
                 device,
                 sparse=False,
                 sparsity_pattern=None 
                 ):
        super().__init__()
        self.embeddings = Embeddings(vocab_size, n_embd, block_size, device)
        self.blocks = nn.Sequential(*[TransformerBlock(attention, 
        n_embd, 
        n_head, 
        dropout, 
        block_size, 
        device, 
        sparse, 
        sparsity_pattern) for _ in range(num_layers)])
        self.ln_f = nn.LayerNorm(n_embd)
        self.head = nn.Linear(n_embd, vocab_size, bias=False)

    def forward(self, x):
        x = self.embeddings(x)
        attn_weights_all = []
        for block in self.blocks:
            x, attn_weights = block(x)
            attn_weights_all.append(attn_weights)
        x = self.ln_f(x)
        logits = self.head(x)
        return logits, attn_weights_all

Benchmarking Attentions

Over the years the scientific and research community have developed various attention mechanisms. In the following section, I have written three attention mechanisms including Casual, Flash, and Sparse Attention, and trained them on the wiki_medical_terms dataset which can be found on Huggingface.

The purpose of this experiment was to see how the custom GPT with different attention mechanisms performs when trained on CUDA and MPS. As a part of this startup, my aim is to create a foundational model to experiment with various ideas and approaches, and also use it for prototyping with the M1 Macbook Air. This project also emphasizes scaling the model efficiently.

Below is the configuration I used for the custom GPT. The model I used was significantly smaller.

# I/O
out_dir = 'out'
eval_interval = 2000
log_interval = 1
eval_iters = 200
eval_only = False
always_save_checkpoint = True
init_from = 'scratch'  # 'scratch' or 'resume' or 'gpt2*'

# Data
# data_path = 'data/openwebtext.txt'  # Path to your text file
gradient_accumulation_steps = 5 * 8
batch_size = 2
block_size = 512

# Model
n_layer = 6  # Reduce the number of layers
n_head = 8   # Reduce the number of attention heads
n_embd = 512 # Reduce the embedding size
dropout = 0.0
bias = False

# AdamW optimizer
learning_rate = 6e-4
max_iters = 1000
weight_decay = 1e-1
beta1 = 0.9
beta2 = 0.95
grad_clip = 1.0

# Learning rate decay settings
decay_lr = True
warmup_iters = 200
lr_decay_iters = 1000
min_lr = 6e-5

# DDP settings
backend = 'gloo'

Fig 11. Bar graph showing validation loss

Validation loss: Fig 11. shows that the mps device has a lower evaluation or validation loss compared to its GPU counterpart. Now, this validation loss is calculated just before the training starts. In essence, evaluating the model before any training allows you to establish a baseline performance metric and also ensures that the model pipeline is fully functional.

But an interesting thing to observe is that the Casual attention loss in mps is quite higher compared to the Flash and Sparse attention indicating that the former is inefficient than the latter. However, notice a loss of similarity between Casual, Flash, and Sparse attention when training in GPU.

This indicates that the training efficiency is largely affected by the on-device and cloud accelerator.

Training Loss: M1 MacBook Air showed faster convergence and reached lower training loss in fewer steps. Nvidia T4 had slower convergence with higher final training loss. The M1 MacBook Air’s architecture might be more efficient for this specific model and dataset.

Memory Footprint Utilization (MFU): The M1 MacBook Air showed lower and more stable memory utilization. On the other hand, Nvidia T4 showed higher and fluctuating memory utilization. The M1 MacBook Air manages memory more efficiently, leading to stable performance.

Process Memory Available: Nvidia T4 has higher available memory (~25,000 MB) compared to M1 MacBook Air (~10,000 MB). This larger memory capacity could be advantageous for larger models or datasets.

GPU Power Usage: Nvidia T4 consistently has high GPU usage (~100%). M1 MacBook Air has minimal to no GPU usage. Nvidia T4 leverages its dedicated GPU for intensive tasks, while the M1 relies more on its efficient CPU. This also implies that the training on M1 is slower compared to the GPU’s.

Fig 16. Dashboard showing model’s performances on mps and cuda

I found that more stable training processes tend to avoid abrupt changes in learning, leading to smoother, faster, and more reliable convergence. Because of the SoC, the flow of data is smooth making it efficient for certain models and datasets. Also, the communication between the different modules of the ARM-based chip makes loading the model cheaper. For instance, when I loaded the model in M1 Air it showed the size of the model as 19.18M compared to 44.63M when loaded in cuda. The M1 chip takes a distributed approach, with the CPU, GPU, and Neural Engine working together to handle different parts of the deep learning computation

Another thing I noticed is although the convergence is faster the training speed is slower in M1 compared to cuda. Also, when we start scaling the model we reach memory limitation and swap memory increases.

But when it comes to model’s performance it is quite evident that flash attention performs best in cuda because of its convergence rate.

Flash Attention is highly efficient due to its unified computation of QKV in a single linear operation. This helps in reducing the number of required matrix multiplications.

It leverages tensor rearrangement for parallel processing, allowing computations for all attention heads to be performed simultaneously on the GPU. The scaled dot-product attention mechanism stabilizes gradients, and causal masking is efficiently implemented using tensor operations. Optimized functions like softmax and dropout are utilized for calculating attention weights, and the attention output computation is parallelized across all heads and batches. The overall design ensures fast and memory-efficient processing, making Flash Attention superior in performance compared to traditional attention mechanisms.

While further experimentation is required one can definitely see that the choice of attention and training hardware plays a major in developing a good custom GPT. Through this experiment you can see that communication within the chip is very important. While the M1 chip shows promise for efficient deep learning training for smaller models, its training performance may still lag behind dedicated GPU-accelerated systems, especially for large and complex models.

Conclusion

The attention mechanism has revolutionized modern deep learning models, enabling them to focus on the most relevant parts of input data, whether it be text or images. From the initial introduction of RNNs and LSTMs, which struggled with long sequences, to the development of the transformative self-attention mechanism in Transformers, attention has continually advanced the field. This evolution has culminated in powerful large language models like GPT, capable of deep contextual understanding and efficient parallel processing.

Through practical implementation and benchmarking, it becomes evident that the choice of attention mechanism and hardware significantly impacts model performance. The M1 MacBook Air, despite its efficient memory management and fast convergence for smaller models, still falls short in training speed and capacity compared to GPU-accelerated systems like the Nvidia T4. Flash Attention, with its superior efficiency and parallel processing capabilities, emerges as a standout mechanism for high-performance training on GPUs.

In summary, the ongoing advancements in attention mechanisms and hardware optimizations are crucial for the future of custom GPT development. As research progresses, these innovations will continue to shape the landscape of deep learning, offering new possibilities for more efficient and powerful AI models.

Cited as

@article{barla2024attention,
title = "A study on Attention mechanism",
author = "Barla, Nilesh",
journal = "perceptronai.in",
year = "2024",
url = "www.perceptronai.in/a-study-on-attention-mechanism/"
}