Understanding VRAM Requirements to Train/Inference with Large Language Models (LLMs)

Siddhesh Gunjal
4 min readOct 25, 2023

--

Image of DGX H100 with black background.
Image of DGX H100 from NVIDIA

In the ever-evolving landscape of artificial intelligence, Large Language Models (LLMs) have become pivotal in shaping the future of natural language processing tasks. These sophisticated models, however, come at a cost — a significant demand for computational resources. Among these, one of the critical components is Video Random Access Memory (VRAM), which plays a crucial role in the training process.

In this article, we will delve into the intricacies of calculating VRAM requirements for training Large Language Models. Whether you are an AI enthusiast, a data scientist, or a researcher, understanding how VRAM impacts the training of LLMs is essential for optimizing performance and ensuring efficient utilization of hardware resources.

Formula to Calculate activations in Transformer Neural Network

This paper "Reducing Activation Recomputation in Large Transformer Models" has good information on calculating the size of a Transformer layer.

Activations per layer = s*b*h*(34 +((5*a*s)/h))

Where,
b: batch size
s: sequence length
l: layers
a: attention heads
h: hidden dimensions

The above paper calculated this at 16-bit precision. The above is in bytes, so if we divide by 2 we can later multiply by the number of bytes of precision used later. So the formula to calculate activations in transformers layer becomes:

Activations = l * (5/2)*a*b*s² + 17*b*h*s……………….(1)

The Formula to calculate VRAM requirement for LLMs

p * (Activations + params) => VRAM in Bits………….(2)

Let the calculations begin (Taking LLaMa-1 7B for example):

Below are the default configuration of LLaMa-1 7B model, so let’s calculate VRAM required to train it with this default configuration.

VRAM requirement for Batch size 32:

params = 7*10⁹
p = 32 #precision
b = 32 #batch-size
s = 2048 #sequence length
l = 32 #layers
a = 32 #attention heads
h = 4096 #hidden dimension

Substitute these values in Formula №1 to get the Activations in Network.

Activations in Network = 348,160,786,432

Now substitute this value in the Formula №2 to calculate VRAM

VRAM = p * (Activations + params)
VRAM = 32 * (348,160,786,432 + (7*10⁹))
VRAM = 11,365,145,165,824 Bits
VRAM = 1323.077 GB

We need Minimum 1324 GB of Graphics card VRAM to train LLaMa-1 7B with Batch Size = 32.

We can also reduce the batch size if needed, but this might slow down the training process. Time required for training depends on the CUDA compute capability of the GPUs we opt for. To know the CUDA compute capability of all NVIDIA GPUs refer this link.

VRAM requirement for other batch sizes:

Batch size 16 = 674.577 GB ~ 675 GB
Batch size 8 = 350.327 GB ~ 351 GB
Batch size 4 = 188.2 GB ~ 189 GB

Note: When we reduce the Batch Size, the time to train the model might increases.

VRAM for Inference/Prediction with LLM on LLaMa-1 7B:

While running the inference batch size always remains 1. So configuration to run inference becomes as follows:

params = 7*10⁹
p = 32 #precision
b = 1 #batch-size
s = 2048 #sequence length
l = 32 #layers
a = 32 #attention heads
h = 4096 #hidden dimension

Substitute these values in Formula №1 to get the Activations in Network.

Activations in Network = 10,880,024,576

Now substitute this value in the Formula №2 to calculate VRAM

VRAM = p * (Activations + params)
VRAM = 32 * (10,880,024,576 + (7*10⁹))
VRAM = 572160786432 Bits
VRAM = 66.6083 GB

We need Minimum 67 GB of Graphics card to run single instance of inference/prediction of LLaMa-1 7B with 32-Bit Precision.

During deployment we can try to reduce VRAM consumption for inference by quantizing the model to 16-bit or 8-bit float point precision. This quantization process might affect the accuracy/confidence of the model.

VRAM requirement for other precision values:

16-Bit Quantization = 33.3041 GB ~ 34 GB
8-Bit Quantization = 16.6521 GB ~ 17 GB

Note: If we go for models with bigger parameter size like 13B, 20B, 50B, etc. the VRAM requirement will vary accordingly and will highly depend on the configuration of that specific Model structure.

So, now you’ll just have to find out the configuration of your LLM and substitute those values in these formulae calculate the VRAM requirement for your selected LLM for both training and inferencing.

References:

I hope this article helps you in determining VRAM requirement to train & deploy your LLM in Development and Production.

How do you determine which server to go for training your LLM? That would be an interesting conversation right? Comment down your thoughts.

Read about my open-source package — Slackker

--

--

Siddhesh Gunjal

ML Engineer | Creator of Slackker (PyPi pakage) | Former Adjunct Faculty @upGrad | Former Professor @BSE (Bombay stock exchange)