Batching to optimize model execution

Jaideep Ray

Published in

Better ML

2 min readDec 22, 2022

Context :

RAMs in GPU are still an order magnitude lower than CPU counterpart. This makes GPU RAMs an expensive resource in training & inference. In this post we look at optimizing batch size as a way to maximize GPU RAM usage.

Optimal batch sizes:

GPU RAMs store the following during training & inference:

Parameters of the neural network.
[Training only] Values from forward pass that are used in backward pass.
[Training only] Optimizer values such as momentums.
Local kernel values.

Increasing the batch size increases the storage for (2), (3).

There might be a hard limit beyond which increasing batch sizes makes GPUs go OOM. Batch sizes also influence convergence & generalization.

Larger batch sizes may cause bad generalization and smaller batch sizes slow down convergence.

Gradient accumulation [Training only optimization]:

In gradient accumulation the loss and gradients are calculated after each mini-batch, but instead of updating the model parameters, it waits and accumulates the gradients over consecutive batches. The resultant effect is that of using a larger mini-batch.

For example, if gradient accumulation accumulates after a step of 4, and each step is a mini-batch of 6 examples, the gradients are applied after 4x6=24 examples.

This allows increasing the global batch size while still being limited by GPU memory.

2. Cross request batching during inference:

Realtime inference often has underutilized GPU memory. One way to build bigger batches is to pool multiple client requests and build bigger batches. This trades off latency (pooling -> executing -> separate out responses) with GPU utilization by using bigger batches.

Left: underutilized GPUs, Right: Use request queueing.

The challenge with request queueing is degrading latency. This is critical to get right for realtime requests. Strictly enforced timeouts and dynamic batch sizes can prevent starvation in the queue.

Conclusion:

Batch size is one of the important hyper parameters for improving training and inference throughputs.
We looked at a training optimization — gradient accumulation and an inference optimization — cross request batching to improve GPU memory utilization.

Batching to optimize model execution

Context :

Optimal batch sizes:

Conclusion:

Written by Jaideep Ray