Offline Batch Inference for large models

Context:

LLMs are being used to improve multiple ML based products— ads & search, content recommendations, detecting violating components etc. A common ML lifecycle step in all these is applying LLMs to the entire corpus of content (ads, search queries, document text) and evaluating the predictions. Applying LLMs is resource intensive.
Let’s consider Llama–7B model, the latency is around ~25 ms/token on 1 A100 GPU [1]. If you have a corpus of 100 million queries of average length of 8 tokens, we are looking at ~30 days of compute on a 8 A100 GPU box.
Now consider experimenting with multiple LLMs and multiple products with customizations (search queries, ads content, post titles etc) — voila, we have a problem at hand.

Designing an efficient and scalable system for batch inference with large language models on large datasets is a challenging task. Here are some key considerations to keep in mind when designing such a system:

Compute split over CPU / GPU: Accelerators (GPUs/TPUs) often provide an order of magnitude improvement in latency and throughput specially at large batch sizes. In order to optimize for high utilization of GPUs, one must carefully split the batch processing pipeline compute in CPU/GPU. For example, standard feature processing, light-weight sub-model inference can be done over CPU cluster and large model inference can be done over GPUs.
Distributed batch inference over multi-GPU and multi-node: To further speedup batch inference, model can be loaded on multiple GPUs on single or multiple nodes to do distributed batch inference. This is similar to distributed data parallel training paradigm and can provide linear scale up.
Choice of GPUs: When choosing GPUs for batch inference, it is important to consider the trade-off between cost and performance. While high-end GPUs like the NVIDIA A100 offer excellent performance, they may not be necessary for all use cases. Cheaper GPUs such as the A10G may be sufficient for inference tasks that require less memory. [5]

Choice of GPUs for inference workloads[7]

4. GPU utilization: Higher GPU utilization results in efficient usage of GPU cluster. Some quick parameters to optimize — batch size, collating predictions across several batches in memory before flushing out to store, dynamically constructing batches from multiple workloads if possible.

5. Model optimizations: There are several active workstreams in the ML community to optimize inference models. Freezing weights, quantization, enabling graph mode, ops fusion, using more performant attention mechanisms (BetterTransformer[2]) can have significant improvements (2–10x) in latency / throughput or both depending on model architecture. Tensorrt[3] and Onnx[4] can be integrated in your batch processing pipeline to provide speedups.

In summary, designing an efficient and scalable system for batch inference with large language models on large datasets requires careful consideration of compute resources, distribution strategies, and hardware choices. It is well worth the effort!

References:

Offline Batch Inference for large models

Written by Jaideep Ray