LLM Inference Optimizations — Splitwise

Published in

Byte-Sized AI

8 min readSep 7, 2024

The rapid rise of generative large language models (LLMs) has brought widespread adoption of power-hungry GPUs, leading to high operational costs. An LLM inference request generally consists of two phases:

Prompt computation (compute-intensive)
Token generation (memory-intensive)

Each phase has distinct demands in terms of latency, throughput, memory, and power. While batching and scheduling techniques [2][3] have made strides in optimizing the prompt computation phase, the token generation phase remains inefficient, underutilizing compute resources. This presents opportunities for optimizing power and cost.

What is new about Splitwise?

Before Splitwise, LLM inference relied on three common batching methods to improve throughput:

Request-level batching: This default method batches requests only at the start, leading to long wait times during token generation and resulting in high time-to-first-token (TTFT) and end-to-end (E2E) latencies.
Continuous batching: Optimizes scheduling before each forward pass, prioritizing the prompt phase to reduce TTFT, but increases token generation time (TBT — Time Between Tokens), which negatively impacts E2E latency.
Mixed batching: Runs the prompt and token phases together, improving token phase throughput but still causing token phases to experience longer runtimes.

Batching techniques and their latency impact [1]

To solve these inefficiencies, Microsoft introduced Splitwise, a model deployment and scheduling strategy that separates LLM inference into two distinct phases — prompt computation and token generation — across different machines. This approach enables phase-specific hardware optimization, improving both cost efficiency and performance.

Key Observations

Different inference services exhibit varying distributions of prompt input tokens and generated output tokens

The figure below illustrates the token distributions for two types of production traces from Microsoft. Coding LLM services generally process larger input prompts, with a median size of 1,500 tokens, due to the inclusion of substantial user-written code. In contrast, conversation services exhibit greater variability in prompt sizes, with a median of 1,020 tokens, depending on user input. For generated tokens, coding services produce significantly fewer, with a median of 13 tokens, whereas conversation services generate a much higher number, averaging 129 tokens.

Distribution for prompt and generated tokens for two production traces: Coding, Conversation (Chatbot) [1]

Mixed continuous batching often operates with very few active tokens batched

The figure below shows the time distribution for machines processing varying numbers of active tokens within a mixed batch. During the prompt phase, active tokens are measured based on the prompt size (e.g., 100 tokens), but in the token generation phase, each request counts as a single active token since tokens are generated one at a time. For the conversation service, 60–70% of the time is spent running with 20 or fewer tokens. In the coding service, which generates fewer tokens, over 20% of the time is spent running with just a single token.

Increasing the batch size during the token generation phase leads to higher token generation throughput.

The batch size during the prompt phase should be limited to ensure optimal performance, whereas batching during the token generation phase can substantially boost throughput without compromising performance.

Impact of batching on the throughput for the 2 LLMs [1]

For memory-intensive workloads, A100 can be more cost-effective than H100

While the industry continues to push out more powerful GPUs, these advancements come with higher costs and increased power consumption. As shown in the table below, compute performance has improved significantly, but memory bandwidth and capacity have lagged behind. For instance, NVIDIA’s latest H100 GPUs deliver 3.43× more compute power and consume 1.75× more power compared to the A100. However, the memory bandwidth has only improved by 1.64×, with no increase in memory capacity. This makes the H100 ideal for compute-intensive tasks, whereas the A100 remains better suited for memory-bound workloads.

For cost-efficiency, H100 is best suited for compute-intensive workloads while A100 is better for memory-bound workloads [1]

The table below demonstrates that token generation (TBT) experiences a lower performance impact than the prompt phase (TTFT) when using the A100 GPU instead of the H100. In fact, the A100 generally provides comparable or even better cost and energy efficiency than the H100 for inference tasks.

A100 vs. H100 for a single inference request on Llama-70B [1]

Power cap can be reduced during the token generation phase without sacrificing performance

While the prompt phase efficiently utilizes the GPU’s power budget, the token generation phase does not, making it suitable for less compute-intensive hardware to achieve better performance-per-watt (Perf/W) and performance-per-dollar (Perf/$) efficiencies. As shown in the figure below, the prompt phase is more sensitive to the GPU’s power cap, whereas the token phase is less affected

How Splitwise works ?

Splitwise optimizes LLM inference by splitting the prompt and token generation phases across different machines. It utilizes two primary machine pools — one dedicated to prompt processing and another for token generation — along with a mixed pool that adapts based on workload demands. All machines are preloaded with the model.

When a new inference request arrives, the scheduler assigns it to two machines: one for the prompt phase and one for the token phase. The prompt machine processes the input, generates the first token, and creates a KV-cache, which is transferred to the token machine to complete the token generation. Continuous batching on the token machines maximizes efficiency, while mixed machines apply mixed continuous batching.

Splitwise — “Scheduling System”

Splitwise employs a two-level scheduling system: 1) Cluster-Level Scheduler and 2) Machinge-Level Scheduler.

Cluster-Level Scheduler (CLS)

The CLS manages machine pools and request routing. It oversees the prompt, token, and mixed machine pools, initially allocating resources based on anticipated request loads and token distributions. Machines can dynamically switch between pools to reduce fragmentation and meet service-level objectives (SLOs) during high-demand periods.

CLS uses the Join the Shortest Queue (JSQ) [4] strategy to simultaneously assign prompt and token machines and allows KV-cache transfers to overlap with prompt computation and minimize overhead. If queue lengths exceed a threshold, CLS taps into the mixed pool for additional resources and dynamically reallocates machines between pools. Mixed machines operate with mixed batching and return to their original pools once queues are cleared.

2. Machine-Level Scheduler (MLS)
The Machine-Level Scheduler (MLS), running on each server, manages GPU memory, oversees pending queues, determines batch sizes, and communicates with the Cluster-Level Scheduler (CLS).

Prompt machines: MLS uses a first-come, first-served (FCFS) scheduling method, limiting batch sizes to 2,048 tokens to maintain throughput.
Token machines: MLS also follows the FCFS method, batching tokens up to memory capacity. Throughput increases until memory is fully utilized, after which tokens are queued.
Mixed machines: To meet time-to-first-token (TTFT) objectives, MLS prioritizes prompt phases over token phases and may preempt token generation if necessary. To prevent token starvation, token priority increases over time, and preemptions are limited.

Splitwise — “KV-Cache Transfer Optimization”

In Splitwise, KV-cache transfer is optimized by overlapping it with prompt computation. As each layer of the LLM is processed on the prompt machine, the corresponding KV-cache is generated and asynchronously transferred to the token machine layer by layer, while prompt computation continues. This reduces transfer overhead, allowing the token phase to start earlier and freeing up KV-cache memory on the prompt machine.

Although layer-wise KV-cache transfer requires fine-grained synchronization, which can slightly increase TTFT for smaller prompts, the overhead is minimal. Splitwise mitigates this by using serialized KV-cache transfer for smaller prompts and layer-wise transfer for larger ones, minimizing transfer and interference overheads overall.

Splitwise uses serialized KV-cache transfer for smaller prompts and layer-wise transfer for larger prompts

Evaluation Results

LLM inference engines can be optimized with two primary goals: 1) maximizing throughput within power and cost constraints, or 2) optimizing power and cost for a specific throughput target. Across all scenarios, Splitwise consistently outperforms traditional LLM inference deployment methods.

The evaluation compares Splitwise designs against two baselines: Baseline-A100 and Baseline-H100, where clusters consist solely of DGX-A100s and DGX-H100s, respectively. Both baselines use the same mixed continuous batching technique employed by Splitwise for its mixed pool machines. The table below outlines the configurations used to evaluate Splitwise’s performance relative to these baseline systems.

In the Splitwise-HHcap configuration, DGX-H100 machines are used for both the prompt and token pools. However, the token machines are power-capped to 70% of their rated capacity, with each GPU limited to 50% of its power, optimizing energy efficiency without compromising throughput.

Evaluated Splitwise designs all normalized to DGX-A100 [1]

Throughput-optimized cluster designs (Iso-power)

As shown in the left figure (a), Splitwise dramatically improves throughput while maintaining the same power usage. For example, Splitwise-AA achieves 2.15× higher throughput than Baseline-A100 at equivalent power and cost. When compared to Baseline-H100, Splitwise-HA delivers 1.18× greater throughput at 10% lower cost and the same power.

Throughput-optimized cluster designs (Iso-cost)

As illustrated in the left figure (b), Splitwise-AA provides the highest throughput at the same cost, delivering 1.4× more throughput than Baseline-H100, though it consumes 25% more power.

Iso-throughput cluster designs (power-optimized)

In the right figure (a), power efficiency is the focus. Splitwise-HHcap matches the throughput of Baseline-H100 while reducing power consumption by 25%, making it an attractive solution for cloud service providers (CSPs).

Iso-throughput cluster designs (cost-optimized)

As seen in the right figure (b), Splitwise-AA achieves the same throughput as Baseline-H100, but with 25% lower costs.

Throughput-optimized cluster designs (left a,b) vs. iso-throughput cluster designs (right a,b) [1]

Conclusion

Splitwise offers a more cost-effective and power-efficient deployment options for LLM inference. While NVIDIA plans to phase out A100 machines, Splitwise demonstrates that, in many cases, A100-based systems can still provide competitive performance and cost benefit.

References

Splitwise: Efficient Generative LLM Inference Using Phase Splitting, 2024
LLM Inference Optimizations #1 — Continuous Batching
LLM Inference Optimizations #2 — Chunked Prefill and Decode-Maximal Batching
Analysis of JSQ Policy on Soft Real-time Scheduling in Cluster, 2000