NV Triton inference server’s concurrent execution

Published in

Better ML

2 min readFeb 5, 2024

NV Triton inference server’s concurrent execution

What is concurrent execution?

Nvidia’s Triton is one of the popular inference engines used for GPU inference. GPUs are commonly used for scenarios that require low-latency inference for transformer models. Triton supports concurrent execution of model instances. More instances mean more parallel executions and should improve overall inference throughput as a result!

Fig: Triton loading multi-instances of multiple models per GPU.

When multiple requests for the same model arrive simultaneously, Triton schedules them based on the number of instances allowed for that model. If the model is configured to allow N instances, the first N inference requests are immediately executed in parallel. Any additional requests must wait until one of these executions is completed. Note that the request can be a batch of examples if the model supports batching.

So, more instances => better throughput & latency ?

Here, comes a caveat: increasing the number of instances doesn’t necessarily lead to an improvement in throughput.

While trying to improve inference throughput/latency through concurrent execution, one must study the impact on the following variables:

GPU memory
GPU utilization
GPU memcpy utilization
Throughput vs Latency

GPU memory: Triton architecture supports loading multiple instances of single or multiple models in GPU memory for concurrent execution. Triton uses a different CUDA stream per instance to achieve partial memory isolation within a single GPU. Partial memory isolation also means that inference of one model can impact inference of other models loaded in the memory. The A100 / H100 GPUs support Multi-instance GPUs (MIG) where a GPU can be sliced into several GPUs (7) each having its own memory and physical isolation.

GPU utilization: If the GPU cores are already fully utilized running one instance of the model, it doesn’t have spare capacity to run another instance. Instead, the two instances share the same GPU cores, each performing at about 50%. So, increasing instances might not increase GPU utilization and (throughput, latency)

GPU memcpy utilization: To do inference on GPU, the tensors have to be copied from CPU -> GPU. If the data copy cannot be fast enough or CPU processing is slow, it is bottlenecked on CPU. GPU memcpy bandwidth is available as a GPU monitor and a saturation might indicate that increasing instances won’t have much of an effect. Similarly, data processing (e.g : tokenizer) instances on CPU should be increased to feed the model instances on GPU without delay.

Throughput vs Latency: Increasing batch size improves GPU utilization and inference throughput but beyond a point, it starts degrading end to end latency.

Key takeaway:
Concurrent execution of model instances can improve inference performance in most setups. It is important to understand the variables as mentioned above to debug or squeeze in more performance.

NV Triton inference server’s concurrent execution

Written by Jaideep Ray