Sitemap

Why you might want to use vLLM for your open source LLM serving

12 min readJun 24, 2025

--

TL;DR

vLLM is a high-performance inference engine that dramatically improves LLM serving in production deployments. Originally developed at UC Berkeley, vLLM delivers 2–4x higher throughput and significantly improved GPU utilization through two breakthrough innovations:

PagedAttention: Eliminates memory fragmentation by dividing KV cache into small, flexible pages instead of large contiguous blocks. This reduces memory waste to less than 4% and allows processing significantly more parallel requests with the same hardware.

Continuous Batching: Processes requests at the token level rather than request level, keeping GPUs fully utilized by immediately replacing finished requests with new ones from the queue.

Real-World Impact: For a Llama 2 7B model on an RTX 4090, vLLM can theoretically handle 2–4x more concurrent requests than traditional serving methods, with sub-second response times even under heavy load. Performance benchmarks show up to 24x higher throughput compared to standard frameworks.

Deployment: Simple setup using official Docker images with OpenAI-compatible API endpoints, making integration straightforward for existing applications.

Best for: Production environments requiring high throughput, low latency, and efficient GPU utilization where performance and scalability are critical.

Note: If you notice any mistakes or have suggestions for improvement, I’d love to hear from you! I’m always happy to learn from others, and I especially appreciate suggestions that include references, so I can explore the topics in more detail.

Since the release of ChatGPT by OpenAI, artificial intelligence has entered the mainstream media and captured global attention. With Meta’s launch of LLaMA, developers and companies could now build tailored applications without relying on third-party APIs. This shift is especially valuable in the EU, where strict data-protection regulations make open-source model deployments essential.

Tools like Ollama have significantly lowered the barrier to entry — you can run LLaMA-style models locally with minimal setup, even without an AI engineering background. Many friends in software development tell me that they’ve started using Ollama for internal projects, but also in production environments. However, almost none of my friends have heard of vLLM, despite its growing presence.

This blog post is by no means intended to discredit Ollama, but I want to explain why — in my opinion — vLLM is a better fit for production-grade deployments. It offers significant improvements in throughput, latency, and efficient GPU utilization thanks to innovations like PagedAttention and continuous batching.

What is vLLM?

vLLM is an open-source inference and serving engine designed to make large language models (LLMs) faster, more efficient, and easier to deploy at scale. Originally developed at UC Berkeley’s Sky Computing Lab, it has quickly become a community-driven project that’s widely adopted in both academia and industry [1].

vLLM is ideal for productive use because it specifically addresses key challenges in the operation of large language models: Efficiency, speed and scalability.

PagedAttention technology allows vLLM to divide KV cache memory into small, flexible pages instead of large, fixed blocks. In this way, the often enormous memory loss of other LLM servers is virtually eliminated and the GPU can process significantly more parallel requests and longer contexts without reaching its limits [2].

vLLM employs continuous batching: new requests are seamlessly integrated into ongoing batches in real time. This keeps the GPU fully utilized, avoiding delays that come with fixed batch sizes. As a result, vLLM delivers sub-second response times even under heavy user load, significantly boosting hardware efficiency [2].

KV Cache and PagedAttention

To understand why vLLM is so efficient, we first need to look at the KV cache (key-value cache), a key technique in Transformer models (the backbone of an LLM). The KV cache optimizes text generation by avoiding redundant calculations during text generation.

What are K and V (and Q) — Self-Attention in LLMs and KV Caching

I will now go into more detail about the inner working of a Transformer. Readers are welcome to skip this part if they are already familiar with the topic or just not interested.

Keep in mind: The following explanations use complete words for clarity, though LLMs actually process tokens (subwords).

The self-attention mechanism is the fundamental building block of modern transformer models and makes it possible to dynamically weight context relationships between words in a sequence [3]. Context relationships are particularly important because words written in the same way can have different meanings. A frequently used example of this is the word “bank” in “river bank” versus “financial bank”.

Self-attention also enables the model to adapt representations based on modifiers or surrounding context. For example, in the phrase “bright red apple,” the word “bright” shifts the network’s focus, enriching the embedding of “apple.” This dynamic weighting ensures each token’s meaning is informed by its linguistic environment.

But how are the context weights calculated?

For every input word (e.g. “bank” in the sentence “I arrived at the bank of the river”), three vectors are calculated:

  • Query (Q): Represents the current word for which context is being determined. (This vector “asks” which other words are relevant for my meaning?)
  • Key (K): Serves as an “identifier” for each word in the sequence. (This vector “offers“: What do I have to offer for other words in the sentence?)
  • Value (V): Contains the actual information of the word.

These are created by linear transformation of the input embeddings:

where

are trainable weight matrices.

It is also important to note that before the words or tokens are transformed into our 3 vectors by the linear transformation, they are already in a vector representation. Before we perform the linear transformation, our words/tokens are transformed into vectors by some kind of lookup table. Each word/token has an entry in this table. Afterwards, position information (i.e. where the respective word is located in the sentence) is added to the vector.

For each word pair, we calculate how similar the query vector of the current word (e.g. “bank”) is to the key vectors of the other words (e.g. “river”). If the similarity is high, this word (“river”) is given a lot of attention when forming the meaning of “bank”. This similarity is calculated by multiplying the query and key vector (dot product).

The result is often called “attention scores”.

You may now ask yourself why two completely different concepts such as “bank” and “river” should be similar and therefore result in a high attention score. The LLM learns this during training. If words occur more frequently in context with each other during training, the LLM learns to weight them more strongly.

The model scales and normalizes attention scores before applying the value vector.

Scaling is done because when calculating the attention scores, the variance of the values increases with the dimension of the vectors [4]. The variance increases because these scores are the sum of many products of random variables (the individual elements of the vectors). With higher dimensions (i.e. longer vectors), more of these products add up.

The normalization is applied to transform raw attention scores into a stable and interpretable distribution that allows the model to flexibly and effectively weight contextual information

So our formula for the attention scores now looks like this:

where dk is the dimension of the Key Vector (dk = dimension Key-Vector).

Now only our value vector V remains. It holds the contextual information about our current word (“bank”). We multiply V by its corresponding attention weight. If “river” had a high attention weight for “bank”, that weighted sum shifts the meaning of “bank” toward “river bank” in the resulting vector.

KV Caching

Text generation with large language models is an iterative process. The LLM always generates a new token step by step, based on all previously generated tokens. (At the beginning, the input sequence consists only of the user’s prompt.) Once the first token has been generated, it is appended to the input sequence. For the next token, the LLM gets as input the entire previous sequence (prompt + new tokens)— i.e. prompt plus all previous outputs. The text generation process is stopped as soon as a model predicts a specific EOS (End-Of-Sequence) token.

Mathematically, the probability for a sequence (x1, x2, …, xt) is represented as the product of the conditional probabilities for each token:

[Improving Language Understanding by Generative Pre-Training, Radford et al., 2018]

Without optimization, the model would recalculate the key and value vectors required for the self-attention mechanism at each generation step for all tokens in the input sequence. This is extremely inefficient, as most of these values have already been calculated in previous steps and won’t change again [5].

The KV cache solves this problem. Instead of recalculating the key and value vectors for all previous tokens each time, we save these values in the so-called KV cache after their first calculation. At each new step, only the key and value vectors for the newly generated token are calculated and added to the cache. The values already stored in the cache are reused for the attention calculation.

A key feature of vLLM is the way in which the KV cache is managed during text generation. In other existing LLM serving systems, a large, contiguous memory block is reserved on the GPU for each request in order to store the KV values of all tokens of the respective sequence. Variable sequence lengths create large gaps of unused memory. This fragmentation means that a considerable part of the GPU memory remains unused [6].

PagedAttention

vLLM solves this problem with the PagedAttention approach. Instead of reserving a fixed, large memory block for each request, the KV cache is divided into many small, equally sized blocks (“pages”). These blocks can be stored flexibly and independently of each other in the GPU memory. A central block table manages which logical block belongs to which physical memory area [7].

The result: PagedAttention utilizes the GPU memory almost completely and reduces the memory loss to less than 4%. This means that significantly more requests can be processed simultaneously and longer contexts can be served with the same latency and higher throughput [8].

Continuous Batching

When large language models are used in production, batching — combining multiple requests for simultaneous processing — is crucial for GPU utilization and throughput. Without batching, each request would be processed individually, leaving most of the GPU capacity unused [9]. There are three basic batching approaches that differ in their efficiency.

There is a great blog post about this on baseten.co.

Static Batching

Static batching gathers incoming requests until a predefined batch size is reached and then processes all requests at once [9].

Dynamic batching

Dynamic batching improves latency by introducing windowing: the system starts processing either when the batch is full or after a maximum period of time [9].

Continuous Batching

Continuous Batching solves the core problem of LLMs, namely the different output lengths through token-based processing. Instead of batching at request level, the system works at token level. The model weights are loaded sequentially and applied to the next token of each active request [9].

When a request finishes, a new request from the queue immediately takes its place. This means that the same model weights can simultaneously process the fifth token of one response and the eighty-fifth token of another. The system remains continuously busy without having to wait for the longest request [9].

Serving a Llama 2 7B with vLLM

Now, as a theoretical example, we calculate how many requests Llama 2 7B with FP16 (Floating Point 16) precision can process simultaneously if served via vLLM. In the example, we calculate this for the Nvidia RTX 4090 (24 GB of VRAM) and Nvidia A6000 (48 GB of VRAM).

First, we need to calculate the pure memory that the LLM needs in FP16. As we load the model in FP16, each model parameter requires 16 bits, i.e. 2 bytes. The model requires additional memory for overhead (activations, temporary buffers, system processes), which we estimate at 2 GB.

*These figures are estimates and may not be exact.

This means that 32 GB VRAM remains free for the KV cache for the Nvidia A6000 and 8 GB VRAM for Nvidia RTX 4090.

Next, we need to determine how much memory each self-attention head requires to store the Key and Value vectors. For this, we need the so-called head dim— which is also the dimension of each Key and Value vector. Each component of these vectors also occupying 2 bytes (FP16). Typically, you can find the “head dim” directly in the model’s config.json file.

Alternatively, you can derive the head dim from two other configuration values. The hidden size and the number of attention heads, both of which are also listed in config.json. The hidden size represents the dimension of the model’s internal vector representation after embedding the individual tokens (using the lookup table and positional information), but before transforming them into the K, V, and Q vectors needed for attention. Dividing the hidden size by the number of attention heads gives you the head dimension for each attention head (the hidden size is diveded into the attention heads).

This value is crucial for estimating the memory footprint of the KV cache and for understanding how many tokens and parallel requests our GPU is able handle.

For Llama 2 7B we have a hidden size of 4096 and 32 attention heads which results in a head dim of 128.

Since each component uses 2 bytes, we multiply the head dimension by 2 for both Key and Value vectors: 128 * 2 bytes + 128 * 2 bytes = 512 bytes per KV pair.

Next, we need to calculate the memory used for a single layer. So, the total memory per token times the number of attention heads.

Finally, we need to multiply this by the number of layers. For Llama 2 7B, the number of layers is 32. So we have to multiply 32 layers times 16384 bytes, which results in 524228 bytes (approx. 0.5 MB).

Now we can finally determine how many requests we can theoretically process in parallel. To do this, we divide the free memory by the sequence length times the KV cache memory per token.

The result is the following table:

Note: With vLLM’s memory optimizations — especially PagedAttention and continuous batching — the actual number of parallel requests can be 2–4× higher than shown here, as vLLM reduces memory waste to near zero and enables more efficient use of GPU resources [7]

Another way to make even better use of the GPU is to quantize the model into even lower precision.

But to finally deploy our open source LLM with vLLM, we can simply use the official Docker image from Docker Hub. Yes, it really is that simple!

  1. Make sure you have access to the Llama weights on Hugging Face.
  2. Pull the latest vLLM OpenAI Docker Image
docker pull vllm/vllm-openai:latest

3. Run the container

docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=<YOUR_HF_TOKEN>" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model meta-llama/Llama-2-7b-chat-hf
  • --runtime nvidia --gpus all: Enables GPU acceleration.
  • -v ~/.cache/huggingface:/root/.cache/huggingface: Mounts your Hugging Face cache for efficient model loading.
  • --env "HUGGING_FACE_HUB_TOKEN=...": Passes your Hugging Face token into the container.
  • -p 8000:8000: Exposes the API on port 8000.
  • --ipc=host: Shares memory with the host for better performance.
  • --model meta-llama/Llama-2-7b-chat-hf: Specifies the Llama 2 7B chat model.

There are even more parameters you could use to optimize your deployment. It would also be wise to download the model beforehand using the Hugging Face hub cli and mount it as a volume.

# Download the model
huggingface-cli download meta-llama/Llama-2-7b-chat-hf

# Mount
docker run -v ~/.cache/huggingface:/root/.cache/huggingface ...

--

--

Christopher Keibel
Christopher Keibel

Written by Christopher Keibel

AI Engineer | Student (M.Sc. Data Science)

No responses yet