Simplified Explanation of Attention — Approach to Handle Efficient Memory Management for LLM Model Serving.
Sequence Model
Sequence models were introduced to process sequential data where the order of elements is important. Recurrent Neural Networks (RNNs) are one of the widely used algorithms to adapt sequence models to handle information and tasks sequentially.
Below are the few use cases, and sequence models being used.
1. Time-Series Data: Weather data, Financial data, Sensor data, etc.,
2. Natural Language Processing (NLP): Text translation, summarization, sentiment analysis.
3. Speech Recognition: The sequence of audio as an input is very important which helps to process the patterns and enable audio captions.
4. Video Analysis: Video frames are sequential data to analyze video data effectively.
Recurrent Neural Network (RNN) is great, but at the same time, it also comes with certain limitations.
- Long Sequence of Input Dependencies: It is difficult to learn and retain long-term input sequences.
- Vanishing and Exploding Gradients: Due to long-term dependencies, either gradients can go extremely small called Vanishing gradients or gradients become extremely large called exploding gradients in backpropagation.
- Computationally Intensive: RNNs can be computationally intensive when dealing with large and complex datasets. Training a model and inference serving takes a significant amount of time. It also requires a large number of training steps.
- Lack of parallelism: RNNs process sequences one by one, which limits/prevents parallelization and slows down training. In simple terms, despite running on GPU servers for both training and inference, RNN-based models cannot leverage the full benefits of GPUs.
The attention mechanism addresses above mentioned limitations of traditional RNNs when it comes to handling sequential data. Below are the problems that attention mechanisms solve.
Attention Mechanism
Attention is a neural network encode-decoder model that was presented by Dzmitry Bahdanau, et al. in their 2014 paper “Neural Machine Translation by Jointly Learning to Align and Translate”.
Long Range Dependencies
Solution with Attention: The attention mechanism allows the model to focus on different parts of the input sequence selectively regardless of the input length. This solves the model to capture long-range dependencies effectively.
Flexible Contextual Information
Solution with Attention: The attention mechanism provides a dynamic and flexible size of contextual information. The model can assign different weights in different parts of the input sequence.
Parallelization
Solution with Attention: In the context of transformer architecture allow for parallelization. Different parts of the sequence concurrently can run on modern hardware like GPUs and TPUs
Reducing Information Loss
Solution with Attention: Attention mechanisms help to mitigate information loss by allowing the model to selectively attend to different parts of the sequence at each layer, preserving important contextual information.
Efficiency in Natural Language Processing (NLP)
Solution with Attention: Attention mechanisms have shown significant success in NLP tasks. Models like the transformer, which employs self-attention, have become state-of-the-art in machine translation, language understanding, and other NLP applications.
Let's see, how the attention mechanism decides important values! The important values are often referred to as attention weights. The attention mechanism determines the important values during the training phase and by calculating attention weights.
Calculating Attention Weights
Mathematically, attention weights are calculated using a mechanism called “Scaled-Dot-Product Attentions”. The scaled dot product mechanism is commonly used in transformer models.
Before going deeper into attention, let's see how scaled dot product works!
Scaled Dot Product Operation:
Let’s consider two vectors,
the dot product is calculated as follows.
The scaling helps prevent the dot product from becoming too large, providing numerical stability during computations.
Example
Consider two vectors,
In attention, Q (Query), K (Key), and V (Value) — QKV is designed to provide three vectors from each of the input words.
Query, Key, and Value Matrices:
- Q (Query): Contains information to inquire about.
- K (Key): Contains information to compare against.
- V (Value): Contains information to use based on the attention.
QKV is similar to retrieving the value V for a query Q based on the key K in a database.
In a database, when you issue a query based on a certain key, you retrieve the associated value. In a neural network with QKV attention, the attention mechanism calculates a weighted sum of values based on the similarity (or compatibility) between the query and keys. This weighted sum represents the context-aware representation.
Hence, In a Scaled Dot-Product Attention,
Step 1: Dot-Product
Calculate the dot product of Q and K matrices element-wise. This gives us a measure of how much each element in the Query matrix is related to each element in the Key matrix.
Step 2: Scaled Dot-Product Attention
To ensure the dot product doesn’t get too large, we scale them by dividing by the square root of the dimension of the Key vectors.
Step 3: Softmax Activation
Apply the Softmax function to the scaled attention scores to obtain attention weights. Softmax turns these scores into probabilities.
Step 4: Weighted Sum
Use the attention weights to calculate a weighted sum of the Value matrix.
This step emphasizes the values (information) that correspond to higher attention weights.
In summary, Scaled Dot-Product Attention involves looking at how each element in the Query matrix relates to each element in the Key matrix, scaling the results, turning them into probabilities, and then using these probabilities to weight the corresponding elements in the Value matrix for the final output.
The mechanism enables the model to dynamically focus on relevant information based on the context of the task.
In the above sections, we reviewed, How Scaled Dot-Product Attention works! In the transformer, multi-head attention implements multiple-scaled Scaled Dot-Product Attention operation as shown in below.
Key and Value states are used for calculating the scaled dot product attention!
KV Cache
Key and Value states cache during multiple token generation in the decoder or encoder-decoder model.
In a nutshell,
- BERT models are not generative models, i.e., it’s an encoder-only model. Hence, any encoder model does not use KV caching states.
- An autoregressive GPT or any encode-decoder generative model generates tokens that require a KV cache.
In the generation model, each generation step recalculates the previous token attentions. Without the KV cache, it would be heavy computation to generate previous and next tokens.
By caching previous KV values, we can focus only on calculating the attention of the new token. KV Cache is an optimization technique that makes the matrix multiplications faster and keeps the previous token states in memory.
HuggingFace transformer library provides a Boolean option to cache KV. Refer to the below document screenshot’s last parameter section.
This KV caching makes a significant speed in the inference server to generate text and serve with high throughput.
But, wait! Caching KV values in GPU memory is it efficiently happens?
The answer is No.
LLM serving is bottlenecked and we may frequently get an Out of Memory error!
GPU memory keeps Key Value (KV) tensors. Multi-head attention’s KV cache all lives in GPU memory until the text generation completes. Using HuggingFace and Text Generation Inference (TGI) library by default it uses contiguous memory.
Hence the memory utilization is not efficiently managed, the system wastes 60% to 80% of memory due to fragmentation and advance reservation.
PagedAttention — vLLM Open-Source Library
vLLM is an open-source library which Faster in LLM serving, it uses the Operating System (OS) paging concept which solves issues such as fragmentation and inefficient use of available memory.
OS Paging Concept
With paging, the concept of contiguous memory allocation is eliminated. Instead of dividing physical memory into fixed-size contiguous blocks, both the process’s virtual address space and physical memory are divided into fixed-size blocks called pages and frames, respectively. This allows for a more flexible and efficient use of memory.
This is exactly the vLLM library implemented for attention — PagedAttention to handle GPU memory more efficiently.
As per vLLM’s benchmark, backend LLM serving can achieve up to 30x higher throughput when compared with HF backend.
Summary
This blog post covers the sequential model and its applications, delving into the implementation of Recurrent Neural Networks (RNNs) within this framework. The post explores the limitations of RNNs, such as challenges with long context dependencies, issues with vanishing and exploding gradients, the absence of parallel computations, and the computational intensity inherent in RNNs.
To address these challenges, the blog explains the Attention mechanism, detailing, how the Scaled Dot-Product Attention operates mathematically. It breaks down the concepts of Query (Q), Key (K), and Value (V) in simple terms, elucidating the operations of multi-head attention mechanisms. The post also emphasizes the significance of the Key-Value (KV) cache in memory and elucidates how the KV cache plays a crucial role, particularly in decoders or models like GPT.
Furthermore, it highlights the problems of frequent out-of-memory issues and inefficient memory management on GPU machines. The blog explained the advantages of vLLM’s PagedAttention library, showcasing how it efficiently manages memory by leveraging the operating system’s paging solution in LLM serve, resulting in a remarkable up to 30x improvement in throughput.
In the next upcoming blogs, we will talk about the FlashAttention feature!