LLM serving challenges

Jaideep Ray
Better ML
Published in
5 min readJan 27, 2024

A LLM after training is used for text generation (code completion, chat etc). The LLM model server accepts a request prompt where a prompt is a set of tokens (x1, x2, …xn) and the output is a response (xn+1, ….xn+m).

How is LLM serving for a generative use-case different ?

  1. LLMs generate tokens, one at a time, based on the input (prompt) and the previous sequence of the output’s tokens it has generated so far. This process is repeated until the model outputs a termination token. Each generation can be considered to be an iteration.
  2. For an input hidden state sequence(𝑥1,…,𝑥𝑛), a self-attention layer first applies linear transformation to get query, key and value vectors for each token in the context (qi, ki, vi) for token (xi).
  3. The next step is for self attention layer to compute attention vectors aij where query vector for position i (qi) is multiplied with key vectors for tokens before position i. The output is computed by multiplying attention vectors (aij) with value vectors till position i. To speed up generation, the key and value matrices for all tokens in context is kept in memory (also known as KV cache).
  4. During generation, a model’s parameter and attention matrices for key and value (KV cache) needs to be in GPU memory.
    This makes the inference workload memory-bound and it is hard to utilize GPU’s compute capabilities.
1. query, key and value vectors for token at position i.
2. Computing attention and output.

5. The serving is complete when generation reaches max length or end-of-sentence token (end of sentence i.e eos) is emitted.

Let’s see how these make LLM serving non trivial. Before going deeper into LLM serving stack, let’s understand a typical model serving engine.

For LLM serving, both scheduling & optimizing model execution has unique challenges.

  1. Unknown input and output length. This makes batching harder.
  2. Memory: During inference, all model parameters need to be in memory (typically GPU memory). For LLM generation, model parameters and KV cache can easily consume all memory. For large KV caches, effective GPU memory management is required.
  3. Compute: Both time and memory complexity of self attention is quadratic O(n⁴) for sequence length n in generation (Each output at position i takes O(n³) computation due to multiplying attention matrices with values for tokens till that position).

In the following section, we see strategies to mitigate all 3 challenges:

Scheduling & Batching for unknown input & output length:

  1. Batching helps in increasing throughput as it utilizes the high parallel processing capabilities of GPU. An inefficient batching solution will make queuing delays (early requests wait for later requests or delay the requests till early ones finish).
  2. For LLMs certain prompts may need many iterations to complete whereas some may be short. With simple request level batching there might be significant queuing delays. Iteration level scheduling can solve this problem where completed prompts are removed from the batch immediately.
  3. Iteration level scheduling:
Iteration level response generation for a LLM.

Optimizing compute:

  1. LLM inference can benefit from the general model optimization techniques applicable to any large model. Quantization (inference in lower precision) and layer fusion are popular techniques which improve memory usage and performance.

Optimizing memory management:

After loading a 10B model with fp16 quantization on A100 40GB GPU, the remaining ~20GB are enough to fit ~ tokens in the cache. With a moderate sequence length of 2000 it can process fit 6 requests in a batch.

Next we discuss two attention algorithms — PagedAttention and FlashAttention which optimizes memory management.

PagedAttention:

  1. A KV cache is per request per position. Each token in multiple requests will have different KV cache. As the output length for a request is unknown, some amount of GPU memory is reserved for storing KV caches. This often leads to fragmentation (internal due to variability in length and external due to multiple requests in the batch) and waste.
  2. PagedAttention allows storing continuous keys and values in non-contiguous memory space. Specifically, PagedAttention partitions the KV cache of each sequence into blocks, each block containing the keys and values for a fixed number of tokens. During the attention computation, the PagedAttention kernel identifies and fetches these blocks efficiently.
  3. Since PagedAttention minimizes fragmentation, it allows the system to batch more sequences together, increase GPU utilization, and thereby significantly increase the throughput.
  4. The use of a block table to access the memory blocks can also help with KV cache sharing across multiple generations. In parallel sampling, where multiple outputs are generated simultaneously for the same prompt, cached KV blocks can be shared among the generations.

FlashAttention:

  1. Standard attention mechanism uses High Bandwidth Memory (HBM) to store, read and write keys, queries and values. HBM is large in memory (40/80GB in A100), but slow in processing, meanwhile SRAM is smaller in memory, but faster in operations.
  2. In the standard attention implementation, the cost of loading and writing keys, queries, and values from HBM is high. It loads keys, queries, and values from HBM to GPU on-chip SRAM, performs a single step of the attention mechanism, writes it back to HBM, and repeats this for every single attention step. Instead, Flash Attention loads keys, queries, and values once, fuses the operations of the attention mechanism, and writes them back.
  3. Since transformer inference is memory bound, using Flash Attention helps in reducing memory access and thereby increase throughput.

Conclusion:

  1. LLM serving is different from regular model serving. It is an active area of research focusing on three problems: model serving scheduling, optimizing attention algorithm optimization, and general techniques applicable for larg model inference optimization (quantization, layer fusion).
  2. We mention 2 leading solutions — PagedAttention and FlashAttention — optimizations to speedup compute intensive attention mechanism.

--

--