How can you determine the amount of GPU memory required to run a Large Language Models (LLMs)?

Dharil Patel
2 min readSep 9, 2024

--

Introduction

Estimating the GPU memory required for running Large Language Models (LLMs) is crucial for ensuring efficient performance and avoiding resource bottlenecks. Accurate calculation involves considering factors such as model size, batch size, and sequence length. Understanding these requirements helps in optimizing hardware utilization and achieving smoother execution.

Based on this information, you can choose the most suitable GPU for your needs, optimizing performance while also managing costs effectively. Selecting the right GPU helps ensure you have adequate resources without overspending.

In this blog, I’ll demonstrate the formula you can use to calculate the GPU memory needed to run any Large Language Model (LLM).

Equation

M = ((P *4B)/(32/Q)) * 1.2

Based on the above equation,

M is GPU memory in GB (Gigabytes)

P is the number of parameters in the model

4B represents the 4 bytes used per parameter

Q is the number of bits for loading the model

1.2 is for additional overhead (consider extra)

Now, Let’s understand by an real-time example:

Example Calculation

Let’s suppose, you have Llama-2 model which contains 70 billion parameters.

Number of parameters (P): 70 billion

Bytes per parameter (4B): Each parameter is requires 4 bytes of memory

Bits per parameter (Q): Depending on you whether you are loading a model in 16 bit or 32 bit precision. (Note: It is a good practice to load a model in 16 bit (Half-Precision)

Overhead: We can add extra digits for overhead (either 20% or 30%). In this blog, I have specified 1.2 (20%). But you can change accordingly.

Assuming, you are loading your Llama-2 model in 16-bit precision (Half)

M = ((70 * 4) / (32/16)) * 1.2

So, Our answer would be 168 GB. It means you would need 168 GB of GPU memory to server your Llama-2 model with 70 billion parameters and 16 bit precision.

Advantage

The calculation helps identify sufficient GPU to serve a LLM model

This also helps us to handle the memory load of GPU efficiently

This will also helps to reducing the cost of GPU machine

If you like this article, then please follow me for amazing blogs, advancements, many more things in AI.

Thank you :)

Happy Learning !!

--

--

Dharil Patel

AI-SDE || M.Tech in Artificial Intelligence & Machine Learning || Talks about AI