LLM Inference Series: 1. Introduction

Pierre Lienhart
3 min readDec 22, 2023

In this blog post series, I will walk you through the different aspects and challenges of LLM inference. By LLM inference, I mean token generation using decoder-only Transformer models since most challenges and their associated remediation come from that particular architecture and use case. However, useful insights from this series can apply to the inference of Transformer encoder models too.

I assume you already have basic knowledge of the Transformer architecture and of the scaled dot-product attention (SDPA) mechanism as introduced in the famous Attention Is All You Need paper¹. No need to have an in-depth knowledge of the motivation behind the attention mechanism however.

By the end of this series, you will hopefully be able to understand terms often associated with LLM inference like key-value (KV) cache, memory-bandwidth bound, etc., to make sense of the jungle of inference optimizations (quantization, fused kernels, model architecture modifications, etc.) and configurations (batch size, which GPU to use, etc.) and finally, to link them with key performance metrics like latency, throughput and cost.

I hope that you will be able to build an insightful mental model that will enable you to make informed and quick decisions when configuring and optimizing your LLM serving solution. As always with this kind of series, this is the material I wish I had when I first started deploying LLMs to serving endpoints.

Now, let me introduce the outline of this series.

First, one need to understand that token generation using a Transformer decoder consists of two kinds of steps, the prompt processing step and multiple autoregressive steps, with very different hardware utilization profiles. I will refer to this distinction along the entire series.

We will then tackle a first challenge of LLM inference: the quadratic scaling of computation requirements of the attention mechanism in the total sequence length. We will see how a simple caching strategy called KV caching solves that problem. Focusing on the KV cache is unavoidable since when enabled, it is a key input of an autoregressive step. As we will see, KV caching is actually a tradeoff and raises its own set of issues. We will look deeper into these challenges and their mitigations in a dedicated post.

Now that you know everything about the KV cache, we can have a deeper look at how running Transformers for inference (under)utilizes hardware. At this stage, we have to introduce the more general key concept of arithmetic intensity a useful mental model called the roofline model and to link them to both key hardware characteristics like peak FLOPS and memory-bandwidth and to key performance metrics like latency, throughput and cost. We will then apply this knowledge to Transformer inference and gather key insights on how to better utilize hardware and improve our performance metrics. The knowledge acquired at this stage will enable us to understand the motivation of all the major performance optimization strategies out there.

Quantization has been one of the hottest optimization strategy bringing major performance improvements over the last year. While quantization would deserve a whole post series on its own, we will only dedicate one post to provide you with strong basics and to make clear where quantization algorithms help and do not help.

Finally, we need to cover how modern LLM serving frameworks work. Optimizing the model itself is unfortunately not enough to get the best performance when it comes with LLM inference. Model servers indeed play a key role in ensuring the best end-to-end performance by efficiently managing the incoming requests and hardware resources. I hope that this last post will give you useful insights to configure your future LLM deployments.

Here is the corresponding list of posts (links will be added upon release):

  1. Introduction
  2. The two-step process behind LLMs’ responses
  3. KV caching explained
  4. KV caching: A deeper look
  5. Dissecting model performance
  6. Arithmetic intensity (and memory) is all you need
  7. Shrink all the things! A guided tour of LLM quantization
  8. Why you can’t just serve LLMs using a good old model server?

Now enough with the talking, let’s dive right in!

[1]: Attention Is All You Need (Vaswani et al., 2017)

--

--

Pierre Lienhart

GenAI solution architect @AWS - Opinions and errors are my own.