How does vLLM optimize the LLM serving system?

Natthanan Bhukan
CJ Express Tech (TILDI)
10 min readApr 24, 2024

Hi, my name is Tae, and I’m a Machine Learning Engineer (MLE) from CJ Express Tech (TILDI). In this article, I would like to share ideas for serving LLM. Due to the fact that serving LLM doesn’t similarly deploy another type of machine learning, I will share its challenge, their technique, and demonstrate how to solve this issue via vLLM.

Table of content

  • Why is serving LLM so challenging?
  • What is vLLM?
  • What is vLLM’s secret sauce?
  • Demo

Why is serving LLM so challenging?

Computational Resources

Due to the fact that LLM has numerous parameters to perform a prediction, which could start with the 7B parameter and then go up to 321B, deploying this model may require an intensive resource and a lot of optimization rather than using a traditional method to deploy a machine learning model.

Latency

When a sentence or token is complicated, the process takes several minutes to compute a result for the client, which may cause an issue on a large scale or in real-world business. For instance, a company may apply LLM with a product Q&A chatbot, which has a slow response to each question, which could cause frustration for the user. Therefore, applying some method to reduce the latency would be a good practice.

Cost

In a large-scale system or with multiple LLMs in the system, which would consume a lot of budget for the application since LLMs use large resources to process, as a MLE, finding a way to utilize a resource would bring a financial benefit to the system. For instance, lower the cost per request.

What is vLLM?

This project is from UC Berkeley’s students, who have a passion to optimize serving performance in LLMs. Many systems spend a lot of resources on serving LLMs. However, it has a poor response time when using a simple method to deploy it. As a result, vLLM’s team proposes a new method to solve this issue by using the OS’s virtual memory design, which could improve LLM serving performance around 24 times while using half the memory of the GPU compared with the traditional method. To integrate into your system, vLLM provides a simple interface that lets machine learning engineers (MLE) develop it via a Python interface, which you could integrate into your system without using fancy packages or dependencies.

What is vLLM’s secret sauce?

To understand how vLLM achieves the goal of reducing latency and optimizing overall performance in the system, we should know the bottleneck for vLLM and how to resolve this issue.

Memory Usage Issue

Basically, a large language model, or LLM, is a branch of attention neural network, or what some people will refer to as a transformer, with custom decoding based on the model. So, we need to understand a key concept: how LLM generates a token.

Fig. 1: Attention formula

In the autoregressive process of the model, it will compute an attention formula to select a key word as an output. The formula consists of three variables.

Query (Q): A new token in the decoder step or the last token that the model has seen

Key (K): Previous context that the model should attend

Value (V): Weighted sum over previous context

Fig. 2: How QKV is computing in the Attention model

In the mean time, Query is a vector of the dot product between word embedding and weight. Key and Value are matrices since they need to store the previous context and the weight sum of the previous context, which is referred to as the KV cache.

The main reason that model needs to store Key and Value is that it’s an attention algorithm that needs to process a Query variable of previous Key and Value before generating a new token, which could be a massive compute cost. If it doesn’t store, the model could cache it due to the fact that in an auto-regressive process, the compute of the previous context doesn’t change.

So, how does it affect the model’s performance? It reduces the number of requests and key generation length since long texts require a huge KV cache to store. According to Efficient Memory Management for Large Language Model Serving with PagedAttention, around 30 percent of GPU memory is occupied by KV cache. Moreover, this statement could lead to another issue about memory usage.

To get more ideas about how attention is working or how KV cache is managed, I recommend watching these videos.

Memory Fragmentation

Basically, the KV cache of a request is stored in contiguous memory space, since many deep learning frameworks compute with tensor values that need to be stored in contiguous memory. However, KV caches could shrink and grow over time, depending on the model architecture. This storage format could cause an issue and is not useful for computing.

Fig. 3. Internal Fragmentation (https://www.geeksforgeeks.org/difference-between-internal-and-external-fragmentation/)
Fig. 4. External Fragmentation (https://www.geeksforgeeks.org/difference-between-internal-and-external-fragmentation/)
Fig. 5. KV cache memory management in existing systems (PageAttention paper)

Memory fragmentation isn’t a new issue. We could experience this problem in OS memory management. This issue relates to how OS allocates memory, which could be separated into two kinds.

  1. Internal Fragmentation: For instance, Process A requests 32 MB of memory, but it uses only 28 MB, leaving 4 MB that another process couldn’t use since it was reserved by Process A.
  2. External Fragmentation When many processes request blocks of memory with a variety of sizes, it could cause a hole in the memory, which is like a gap in memory.

In terms of the NLP problem, let’s compare the process as a request, and the KV cache is a byte that needs to be stored in the GPU, as you could see from Fig. 5, which shows internal and external fragmentation in the memory.

Overall, memory fragmentation would cause underutilization of the system due to the fact that it couldn’t use a fragment of memory in the system. As a result, the system will not be able to handle large models or large requests.

How to overcome this issue

The vLLM team has developed a new attention algorithm called PageAttention, which is inspired by the OS’s virtual memory concept. One could think of blocks as pages, tokens as bytes, and requests as processes.

Fig. 6. Generation process for a request with PagedAttention. (https://blog.vllm.ai/2023/06/20/vllm.html)

By using a logical KV cache block (virtual memory) for storing each token (byte) from the prompt and mapping the block table (page table) into a physical KV cache block (physical memory),. Consequently, physical GPU memory will be allocated when the system needs it. In theory, it will eliminate internal fragmentation and external fragmentation since the token stores the same cache block size and allocates it when the system requests it, which means no more pre-allocation.

Another benefit of using PageAttention

Using PageAttention allowed the system to share a KV cache since it stores tokens in non-contiguous memory. Which led to many applications for a LLM.

Fig. 7. Generation process for a request that samples multiple tokens. ((https://blog.vllm.ai/2023/06/20/vllm.html))
  1. Copy-on-Write Mechanism: Of course, using the same feature as the OS’s virtual memory, PageAttention could implement this mechanism by referencing the same request logical KV cache block into the same physical KV cache block. Fig. 7 illustrates an idea of how it works.
  2. Beam search: In NLP tasks, beam search is a common use case to select a suitable output token. In the traditional method, the beam search needs to generate all possible tokens for each beam candidate without sharing the KV cache. Unlike using vLLM, it allows the system to share tokens with another beam candidate to increase utilization.
  3. Swapping: Like, OS memory swapping. vLLM will transfer the evict block into CPU memory.

Demo

This section demonstrates how to serve LLM via vLLM. The VM specification setup is A100 40g to test with Llama-2–13b-hf-chat

Compare between vLLM and Hugging face

To test memory usage between vLLM and Hugging Face, this example will test one example request and then monitor GPU usage.

Fig. 8. GPU usage when inferencing a LLM model via Hugging Face
Fig. 9. GPU usage when loading model LLM model via vLLM (Left), GPU usage when inferencing model LLM model via vLLM

From Figs. 8. and 9, the result is quite the same, but there is a slight difference when inference is made with a hugging face, which will raise this error.

WARNING:root:Some parameters are on the meta device device because they were offloaded to the cpu.

Which indicates that VM is running out of GPU memory and needs more resource, so you could notice Hugging Face using CPU memory more than vLLM since it needs to allocate more memory, which leads to leakage of GPU memory due to reservation.

Test code (HF)

from transformers import AutoModelForCausalLM,AutoTokenizer

MODEL_NAME="meta-llama/Llama-2-13b-hf"
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, padding_side="left")
prompt = """What is Thailand's national food symbol."""
model_inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

input_length = model_inputs.input_ids.shape[1]
generated_ids = model.generate(**model_inputs, max_new_tokens=64)

print(tokenizer.batch_decode(generated_ids[:, input_length:], skip_special_tokens=True)[0])

Test code (vLLM)

from vllm import LLM, SamplingParams

prompts = [
"What is Thailand's national food symbol."
]

sampling_params = SamplingParams(max_tokens=64)

llm = LLM(model="meta-llama/Llama-2-13b-chat-hf")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Serving as REST API

vLLM already has built-in RestAPI by using FastAPI via a single command.

python -m vllm.entrypoints.openai.api_server --model "meta-llama/Llama-2-13b-chat-hf" --dtype float16 --api-key "test"

This endpoint is compatible with the OpenAI endpoint, so if your system integrates with OpenAI, you don’t need to change your request format or write a new API for vLLM to replace your existing LLM model.

Fig. 10. Serving API via vLLM command (left), GPU memory usage (right)

Fig. 10 shows starting an API via a command and GPU memory usage, which I mentioned above.

Fig. 11: Test a LLM request

From Fig. 11, illustrate the inference request by using the OpenAI format, whose code is shown below.

from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="test",
)

completion = client.chat.completions.create(
model="meta-llama/Llama-2-13b-chat-hf",
messages=[
{"role": "user","content": "What is Thailand's national food symbol."}
],
temperature=0.2,
max_tokens=64,
top_p=1
)

print(completion.choices[0].message.content)

Load testing

This load test may not be a proper setup, but I would like to point out the GPU usage part with vLLM, so it may not affect a lot.

Fig. 12. Load test with same prompt on 1 users per second (left), Load test with same prompt on 10 users per second (right)

There is a slight difference in GPU memory usage from Fig. 12, since using the same prompt and vLLM will use a copy-on-write mechanism, which helps to utilize the KV cache.

Fig. 13. Load test with random prompt on 1 users per second (left), Load test with random prompt on 10 users per second (right)

Next, load-test on a selected random prompt from the list. The GPU memory usage changes minimally, and the RPS is quite impressive with 4.68 requests per second.

Fig. 14. Load test with random prompt on 10 users/second with longer token

Let’s try with longer tokens (256 tokens). As a result, GPU memory rises modestly with fewer requests per second, which is acceptable since longer content needs more time to generate.

Load test code

from locust import HttpUser, task
import json
import random

questions = [
"What is Thailand's national food symbol.",
"How to cook pad-thai",
"Recommended vacation spots in Thailand",
"Best mall in Bangkok",
"How to travel by public transport in Bangkok"
]

class AskWithSameQuestion(HttpUser):
@task
def test(self):
self.client.post(
"/v1/chat/completions",
data=json.dumps(
{
"model": "meta-llama/Llama-2-13b-chat-hf",
"messages": [
{"role": "user","content": "What is Thailand's national food symbol."}
],
"temperature": 0.5,
"max_tokens": 64,
"top_p": 1
}
),
headers={
"Content-Type": "application/json",
"Authorization": f"Bearer test",
}
)

class AskRandomWithSameQuestion(HttpUser):
@task
def test(self):
question = random.choice(questions)
self.client.post(
"/v1/chat/completions",
data=json.dumps(
{
"model": "meta-llama/Llama-2-13b-chat-hf",
"messages": [
{"role": "user","content": question}
],
"temperature": 0.5,
"max_tokens": 256,
"top_p": 1
}
),
headers={
"Content-Type": "application/json",
"Authorization": f"Bearer test",
}
)

Additional Features

  1. Continuous batching : vLLM already has built-in continuous batching, which utilizes more memory and increases token pre-seconds. You could get more information about this in my previous article, Recap DevFest Cloud Bangkok 2023 : Leveraging Ray and Vertex AI for LLMOps.
  2. BentoML: If your system is based on BentoML, vLLM already integrates with BentoML seamlessly, and you could get more detail from this article.
  3. Distribute inference : Distribute system ? vLLM supports distributed inference with Ray.

Summary

vLLM shows that making fantastic stuff doesn’t require fancy stuff by using simple concepts that we have already applied to computer systems for a decade. This framework has tremendous improvements in GPU memory usage and extends a variety of benefits with the PageAttention technique. By reducing the usage of KV cache, the system was able to handle a larger load and inference faster.

Last year, the vLLM team presented their work at Ray Summit 2023, where you could get more ideas and inspiration about their work by watching this video.

Fast LLM Serving with vLLM and PagedAtention from Ray Summit 2023

Reference

--

--