Sitemap

How Much GPU Memory Do You Really Need for Efficient LLM Serving?

7 min readJan 27, 2025

--

Overview of LLM-GPU-Calc
Overview of LLM-GPU-Calc

🛠 Source Code: GitHub Repository

This story was written with the assistance of an AI writing program.

Introduction

Large Language Model (LLM) services often face critical questions:

✅ How many concurrent users or requests can the service handle?
✅ What level of GPU resources are necessary for smooth operation?

In this guide, I will explore how to predict and calculate GPU requirements for LLM serving using vLLM as the inference engine. I will cover:

  • Understanding KV cache and memory profiling
  • How GPU block allocation works
  • A comprehensive formula for estimating GPU memory requirements

Key Concepts

KV Cache

The KV cache stores key-value pairs for attention mechanisms in transformer models. It is instrumental in reducing the computation cost of subsequent tokens during inference. The size of the KV cache is influenced by:

  • Batch size: Number of concurrent requests processed in one forward pass.
  • Token length: Input and output sequence lengths.
Illustration of the key-value caching mechanism
Illustration of the key-value caching mechanism [Source]

Role of KV Cache in vLLM

In vLLM, the KV cache allocation directly affects the maximum number of concurrent requests and sequence lengths that can be processed. If we know the input/output length and the maximum number of concurrent requests, we can roughly estimate the GPU specifications required for inference.

The engine profiles the GPU’s memory usage to determine how many KV blocks can be allocated without causing out-of-memory (OOM) errors. Here is a breakdown of the profiling process:

  1. Profile Run: A forward pass is performed using dummy inputs to measure memory usage during inference.
  2. Memory Segmentation: The engine calculates available KV cache memory by subtracting model weights, non-torch memory, and PyTorch activation memory from total GPU memory.

Understanding Profiling in vLLM

The profiling process in vLLM is designed to balance memory for both the model and the KV cache. Here’s how it works:

Why Profiling is Necessary

In vLLM, a profile run is conducted before allocating the KV cache to separate the memory used for model inference from the memory needed for the KV cache. During inference, dynamically increasing memory must be tracked because intermediate operations, CUDA memory management, memory caching, and reservation policies can cause unexpected spikes in memory usage. Profiling helps to account for these factors, ensuring dynamic memory management overhead is included in the calculations and minimizing unforeseen memory increases.

Profiling in Action

During a profile run, peak activation memory is measured by forwarding a dummy input sequence with the maximum batch tokens (max_num_batched_tokens) through the model. This simulates real-world GPU usage and helps determine memory requirements for the target environment. The following code illustrates this process:

# https://github.com/vllm-project/vllm/blob/v0.6.6.post1/vllm/worker/worker.py
# Execute a forward pass with dummy inputs to profile the memory usage
# of the model.
with memory_profiling(baseline_memory_in_bytes=total_gpu_memory -
self.init_gpu_memory,
weights_memory_in_bytes=self.model_runner.
model_memory_usage) as result:
self.model_runner.profile_run()
torch.cuda.synchronize()
# https://github.com/vllm-project/vllm/blob/v0.6.6.post1/vllm/worker/model_runner.py
def profile_run(self) -> None:
# Enable top-k sampling to reflect the accurate memory usage.
sampling_params = SamplingParams(top_p=0.99, top_k=self.vocab_size - 1)
max_num_batched_tokens = self.scheduler_config.max_num_batched_tokens
max_num_seqs = self.scheduler_config.max_num_seqs

The profiling process also accounts for intermediate activation memory, which increases during inference:

  • Smaller max_batched_tokens reduce intermediate activations, freeing up KV cache space, but limit the size of requests the model can process.
  • Applications must consider the maximum token length requirements to balance activation memory and KV cache size effectively.

This ensures that:

  • Memory used by the model and intermediate activations is measured.
  • Available memory for KV cache is dynamically calculated.

Formula for GPU Memory Allocation

The GPU memory available for KV cache is calculated as:

available_kv_cache_memory = total_gpu_memory * gpu_memory_utilization 
- (model_weight + non_torch_memory + pytorch_activation_peak_memory)
illustration of vLLM GPU memory allocation
vLLM GPU memory allocation

Component Breakdown:

  • total_gpu_memory: Total memory of the GPU.
  • gpu_memory_utilization: Fraction of memory allocated for the engine.
  • model_weight: Memory occupied by the model parameters.
  • non_torch_memory: Memory overhead unrelated to PyTorch, dependent on GPU type and number of GPUs:
  • pytorch_activation_peak_memory: Memory required for intermediate activations during inference.

Model Weight

Memory occupied by the model parameters, calculated as:

model_weight = number_of_model_parameters * parameter_data_type_size

Parameter Data Type Sizes:

  • float32: 4 bytes
  • float16 or bfloat16: 2 bytes
  • int8 or fp8: 1 byte
  • int4 or awq: 0.5 byte

Non-Torch Memory

Memory overhead unrelated to PyTorch, dependent on GPU type and number of GPUs:

  • Estimated as about 1GB initially for calculation purposes.

PyTorch Activation Peak Memory

  • Peak activation memory during inference can be calculated based on Reducing Activation Recomputation in Large Transformer Models (NVIDIA, 2022)
  • Activations are tensors generated during the forward pass. While training, these activations are saved for back-propagation and gradient calculations.
  • During inference, the activation values ​​of each layer are not stored separately, but the output of the previous layer is used as the input of the next layer.
  • For inference, only the peak value per layer is required and dropout is disabled while training requires dropout and saving all activation memories.
  • Therefore, I assumed pytroch_activation_peak_memory as the memory usage when passing through one layer.
  • Assumption: Activations are in FP16 (2 bytes per unit), all measurements are in bytes.

Key Parameters from Transformer Architectures:

  • Input = Output Tensor: Sequence length (s) * Batch size (b) * Hidden dimension size (h)
  • a = Number of attention heads
  • L = Number of transformer layers
Figure 2: Transformer Architecture. Each gray block represents a single transformer layer that is replicated L times
Transformer Architecture. Each gray block represents a single transformer layer that is replicated L times [Source]

Total Activation Memory during inference::

pytorch_activation_peak_memory = (attention_block_memory + mlp_activation_memory + layer_norm_memory) * 1

In the case of training, num_layer is multiplied at the end.

The formula is as below:

pytorch_activation_peak_memory = sequence_length * (batch_size=1) * (18 * hidden_size + 4 * intermediate_size)

A forward pass is performed using dummy inputs to measure memory usage during inference, where vllm performs dummy inputs of tokens as long as sequence_length for one batch. In other words, batch size is 1 in pytorch_activation_peak_memory. In vLLM, sequence_length is max_num_batched_tokens when profiling.

Memory Breakdown:

1. Attention Block Memory = 10 * s * b * h(bytes)

For inference, Dropouts are excluded. When using FlashAttention, memory related softmax largely decrease, almost zero.(https://www.determined.ai/blog/act-mem-1)

Speedup of FlashAttention over the PyTorch implementation of attention on GPT-2
Speedup of FlashAttention over the PyTorch implementation of attention on GPT-2 [Source]
  • QK^T matrix multiplies: 4 * s * b * h(storage of Q and K is required)
  • QKV matrix multiplies: 2 * s * b * h(shared input storage)
  • Attention over values (V): 2 * s * b * h + ̶2̶ ̶*̶ ̶a̶ ̶*̶ ̶s̶^̶2̶ ̶*̶ ̶b̶ (related to softmax)
  • Linear projection: 2 * s * b * h
  • ̶S̶o̶f̶t̶m̶a̶x̶:̶ ̶̶̶2̶̶̶ ̶̶̶*̶̶̶ ̶̶̶a̶̶̶ ̶̶̶*̶̶̶ ̶̶̶s̶̶̶^̶̶̶2̶̶̶ ̶̶̶*̶̶̶ ̶̶̶b̶̶̶
  • ̶S̶o̶f̶t̶m̶a̶x̶ ̶d̶r̶o̶p̶o̶u̶t̶:̶ ̶̶̶a̶̶̶ ̶̶̶*̶̶̶ ̶̶̶s̶̶̶^̶̶̶2̶̶̶ ̶̶̶*̶̶̶ ̶̶̶b̶̶̶
  • ̶A̶t̶t̶e̶n̶t̶i̶o̶n̶ ̶d̶r̶o̶p̶o̶u̶t̶:̶ ̶̶̶s̶̶̶ ̶̶̶*̶̶̶ ̶̶̶b̶̶̶ ̶̶̶*̶̶̶ ̶̶̶h̶̶̶

2. MLP Activation Memory = 4 * s * b * (i + h)(bytes)

  • self.gate_proj(x): 2* s * b * h
  • self.act_fn: 2* s * b * i
  • self.up_proj(x): 2* s * b * h
  • self.down_proj: 2* s * b * i

3. Layer Normalization Memory = 4 * s * b * h(bytes)

  • Each layer includes two normalization operations, 2 * s * b * h
  • There are total 2 layer norms.

GPU Block Allocation in vLLM

For vLLM to function, it requires sufficient GPU blocks to store the KV cache. The relationship is defined as:

# https://github.com/vllm-project/vllm/blob/v0.6.6.post1/vllm/worker/worker.py
max_seq_len = block_size * num_gpu_blocks
if not is_attention_free and max_model_len > max_seq_len:
raise ValueError(
f"The model's max seq len ({max_model_len}) "
"is larger than the maximum number of tokens that can be "
f"stored in KV cache ({max_seq_len}). Try increasing "
"`gpu_memory_utilization` or decreasing `max_model_len` when "
"initializing the engine.")

For example, if the model’s max_model_len is 4096 and each block stores 16 tokens, then at least 4096 / 16 = 256 GPU blocks are required.

Estimating GPU Memory Given Concurrent Users

KV Cache Formula

The memory required for the KV cache per batch is:

kv_cache_memory_per_batch = (2 * kv_attention_heads * head_dim * num_layers * kv_data_type_size) * sequence_length
  • kv_attention_heads: Number of kv attention heads in the transformer.
  • head_dim: Dimensionality of each attention head.
  • num_layers: Number of transformer layers.
  • kv_data_type_size: Memory size of the kv cache data type (e.g., FP16 = 2 bytes).
  • sequence_length: (input+output) token length.

Total KV cache size is calculated by kv_cache_memory_per_batch * batch_size (=Number of concurrent sequences).

Estimate the GPU memory given the number of concurrent useres

When available_kv_cache_memory is full, requests become pending, leading to an increase in Time to First Token (TTFT). Therefore, the number of concurrent requests or connections that can be processed by a limited GPU is determined by the point at which 100% of the available KV cache memory is utilized.

The required GPU memory can be calculated by reversing the formula obtained above.

Required GPU memory = [(model_weight + non_torch_memory + pytorch_activation_peak_memory) + kv_cache_memory_per_batch * concurrent_users]/gpu_memory_utilization

Estimate the number of concurrent users given the GPU memory

When available_kv_cache_memory is full, the request becomes pending. Therefore, from that moment on, Time to First Token increases. Therefore, the number of requests or connections that can be processed on a limited GPU is until 100% of the available kv cache memory is used.

The number of concurrent users that can be supported is determined by:

max_concurrent_users = available_kv_cache_memory // kv_cache_memory_per_batch

Where:

  • kv_cache_memory_per_batch: Memory required for a single batch.
  • available_kv_cache_memory: Memory allocated for KV cache after profiling.

Practical Example

Assume the following:

  • Total GPU memory: 40GB
  • GPU memory utilization: 90% (0.9)
  • Model weight: 15GB
  • Non-torch memory: 400MB
  • PyTorch activation peak memory: 1GB

Then:

available_kv_cache_memory = (40 * 0.9 - 15 - 0.4 - 1) GB = 19.6GB

If the KV cache for one batch requires 200MB:

max_concurrent_users = 19.6GB // 200MB = 98 users

Conclusion

Building an accurate GPU estimator for LLM inference with vLLM involves understanding KV cache allocation, memory profiling, and GPU block management. By leveraging these techniques, you can predict GPU requirements and optimize LLM services to handle maximum concurrent requests efficiently.

--

--

Doil Kim
Doil Kim

Written by Doil Kim

AI Researcher & Developer | Optimizing LLMs & Building Scalable AI Solutions | Exploring MLOps & GenAI | Sharing Insights on AI

No responses yet