The Complete Guide to GPU Requirements for Training and Inference of LLM

6 min readAug 31, 2024

Introduction

Whether you’re training an LLM from scratch, fine-tuning it, or deploying an existing model, choosing the right GPU is critical in cost and efficiency. In this blog, we’ll break down everything you need to know about GPU requirements for training and inference of LLMs with single and multiple GPUs with different optimizers and batch sizes.

A computer processor is made up of multiple decisive circuits, each one of which may be either OFF or ON. These two states in terms of memory are represented by a 0 or 1 or bits. A group of eight bits is known as a Byte. 1 Byte can represent numbers between zero (00000000) and 255 (11111111), or 28 which is equal to 256 distinct positions. Generally, neural networks trained on FP-32(including the sign, exponent, and mantissa) datatype that takes a maximum of 4 bytes of memory.

Common data types used for Model Parameters

float (32-bit floating point): 4 bytes per parameter
half/BF16 (16-bit floating point): 2 bytes per parameter
int8 (8-bit integer): 1 byte per parameter
int4 (4-bit integer): 0.5 bytes per parameter

What Consumes GPU Memory?

During model training, most of the memory is consumed by four things

1. Model parameters

Model parameters are the learnable components of a neural network. They define the network’s structure and behavior and are updated during training to minimize the loss function. Generally, we have Weight and Bias parameters.

As we already know to store one number it takes 4 bytes. Assume we have P no of parameters in our model.

Memory for Parameters(M) = Number of Parameters(P) x Precision Size(4 bytes)

M = Px4

Similarly for 16 bit M = P x Precision Size(2 bytes)

We can add a scaling factor and make a standard formula as mentioned below

Here 1.2 represents a 20% overhead of loading additional things in GPU memory and Q is the amount of bits that should be used for loading the model. i.e. 16 bits, 8 bits, or 4 bits.

GPU memory is required for serving Llama 70B in 16-bit

This is the overall minimum GPU required for inference of a Llama 70b model.

2. Activations

Activations are the intermediate outputs of the neurons in each layer as the input data passes through the network. During the forward pass, each layer processes the input data, applying weights, biases, and an activation function (like ReLU, sigmoid, etc.) to produce activations. These activations then serve as inputs to the next layer.

The activations for each layer need to be stored because they are used during backpropagation to compute gradients.

Memory for Activations = Number of Activations x Batch Size x Precision Size

Note: “Activations per Parameter” depends on the model architecture, number of layers, and sequence length. For large models, activations can often require memory comparable to or exceeding the parameters. Doubling the sequence length can potentially double the activation memory as well.

Approximation: There are no fixed formulae to calculate GPU memory for activations. For larger models, the memory required for activations can be roughly similar to or slightly larger than the memory for the parameters.

3. Gradients

Gradients are the partial derivatives of the loss function with respect to the model parameters. They indicate how much each parameter should be adjusted to minimize the loss function.

During backpropagation, the loss is propagated backward through the network, and gradients are computed for each parameter (weight and bias). The optimizer uses these gradients to update the parameters, reducing the overall loss.

The memory required to store gradients is equal to the memory needed for the parameters themselves. Since each parameter has a corresponding gradient, their memory requirements are identical.

Memory for Gradients = Memory for Parameters

4. Optimizer States

Optimizer states are additional variables maintained by certain optimization algorithms (like Adam, RMSprop) to improve the efficiency of training. These states help in updating the model parameters based on past gradients.

Different optimizers maintain different types of states. For example:

SGD (Stochastic Gradient Descent): No additional state; only the gradients are used to update the parameters.
Adam: Maintains two states for each parameter: the first moment (mean of gradients) and the second moment (mean of squared gradients). These help adapt the learning rate for each parameter dynamically. For a model with 1 million parameters, Adam requires maintaining 2 additional values (first moment and second moment) for each parameter, resulting in 2 million additional states.

Memory for Optimizer States = Number of Parameters x Precision Size x Optimizer Multiplier

Total Memory Requirements

Let’s take an example

We want to train a 10 billion model on mixed precision(2 bytes) in a single GPU.

Memory for Parameters=Number of Parameters x 2 bytes (FP16)

Memory for Parameters=10B x 2 bytes = 20 GB

Memory for Activations = Activations per Parameter x Batch Size x Precision Size

Instead of calculating total memory for activation, we can calculate per-layer activation memory which is an efficient way that requires less memory because it can be used in the the next layer.

Approximate Number of Neurons per Layer = sqrt(10B) ≈ 100k neurons per layer

Activation Memory for one layer ≈ 32 x 100k x 2 bytes ≈ 6.4 MB per layer

For layers in a large model (assuming hundreds of layers), activation memory can add up to tens of GB.

So as we have already discussed earlier for a batch size of 32 approximately 20–40 GB of memory is required. Now this range can be doubled if we double the batch size.

Memory for Gradients = Memory for Parameters

Memory for Gradients = 20 GB

Memory for Optimizer States=Number of Parameters x 4 bytes (FP32) x 2 (for Adam)

Memory for Optimizer States = 10B x 4 bytes x 2 = 80 GB

Total Memory Estimate:

Memory for Parameters: 20 GB
Memory for Activations: ≈20–40 GB (depends on batch size)
Memory for Gradients: 20 GB
Memory for Optimizer States: 80 GB

Total Memory = 20 + 20 to 40 + 20 + 80 = 140 to 160 GB

Memory calculation for multiple GPU

To calculate the memory requirement per GPU when training on n GPUs, we need to consider how the memory is distributed across the GPUs using parallelism techniques like data parallelism and model parallelism.

Key Assumptions:

Model Parallelism: The model’s parameters are divided among the GPUs, so each GPU only stores a fraction of the total model parameters. Gradients and optimizer states are similarly divided.
Data Parallelism: Each GPU gets a copy of the entire model’s parameters, but the batch of data is split across the GPUs. Activations are calculated separately for each GPU’s mini-batch.

If we use Model parallelism then all the model parameters, gradients, and optimizer stats are distributed.

However, each GPU still needs to store activations for its portion of the batch. Memory for activations does not scale down with the number of GPUs since each GPU processes its own data independently.

So for activation memory required is still the same for all the GPUs.