Cuda OOM? Learn the Basics of Larger-Than-Memory Models

5 min readJun 15, 2023

Thinking about training a machine-learning model? Is it a gigantic model eating up all your GPU’s memory? Or perhaps it’s a smaller model, but you’re attempting to feed it an ocean of data.

These problems will increasingly pop up as we push LLMs further into everything. I’ll give a general overview and links to the papers about these so you can dig deeper into what you find interesting.

I’m sure there are many methods I am missing in this, but I will only be covering techniques I have personally studied or tested out.

Also, here’s one of my favorite papers on how ChatGPT is inferencing.

Rough calculations for GPU requirements: Transformer Models

###
# Training
###

Model_Parameters_Memory = Params * Precision
Intermediate_Activations_Memory = Seq_Length * Seq_Length * Attention_Heads * Precision
Optimizer_Memory = Params * Optimizer_Requirement
Gradient_Memory = Params * Precision
Input_Memory = Input_Size * Batch_Size * Precision

Total_Memory_Training = Model_Parameters_Memory + Intermediate_Activations_Memory + Optimizer_Memory + Gradient_Memory + Input_Memory

###
# Inference
###

Model_Parameters_Memory = Params * Precision
Intermediate_Activations_Memory = Seq_Length * Seq_Length * Attention_Heads * Precision
Input_Memory = Input_Size * Batch_Size * Precision

Total_Memory_Inference = Model_Parameters_Memory + Intermediate_Activations_Memory + Input_Memory

Here’s a Microsoft paper that tries to do this a lot more accurately

There are two core concepts when dealing with larger-than-memory models.

Splitting the Model up into different components across different machines.
Offloading the Model onto the computer until you need to process it.
They are shrinking the Model by reducing certain aspects while trying to keep it as similar as possible.

Splitting the Model

Splitting a model up is as simple as it sounds. How do you divide a model up to part certain parts on specific computers? I won’t discuss minor details like row or column division methods.

Data Parallelism: one Model of many data Paper

Here, we split the data across GPUs on different machines. The gradients are communicated between the devices, letting us update our model parameters efficiently.

Model Parallelism: one data many models Automatic Model Parallel

Let’s say our Model is too large to fit onto a single GPU. Enter Model Parallelism. We split the Model across multiple GPUs, which can be on a single machine or various devices. This way, we can train even larger models efficiently.

Pipeline Parallelism: one data many models (but different) Paper

Think of an assembly line. That’s what Pipeline Parallelism is like. We place different layers of the Model on different GPUs, executing the layers pipelined. This increases our computational efficiency and GPU utilization.

Offloading the Model

Model offloading refers to techniques that allow parts of a model to be stored off the primary device during training to save memory. Here are some standard model offloading techniques:

Gradient Checkpointing: Paper

This technique reduces memory usage by discarding intermediate outputs in the forward pass and recomputing them during the backward pass. This can significantly reduce memory usage at the cost of increased computation. There are also a lot of variations on what you calculate and when in this method.

Activation Offloading:

This is a more advanced form of Model offloading where the activation tensors (the outputs of each layer) are offloaded to CPU memory or disk during training. This can significantly reduce GPU memory usage, but it requires efficient management of the unpacked data to minimize the impact on training speed.

Parameter Offloading:

This technique offloads the model parameters to CPU memory or disk. This is typically used in conjunction with activation offloading to reduce GPU memory usage further.

Shrinking the Model

The core concept here is to find a way to reduce the model size itself. This is used, for example, to get a model running on your cellphone.

Mixed Precision Training: Paper

Mixed precision training can significantly reduce memory usage using lower-precision (e.g., half-precision) data types for certain model parts. This can be combined with other offloading techniques for even more significant memory savings.

IMPORTANT NOTE:

Mixed Precision is a critical key component to speed up training and the way GPUs are made nowadays. If you are wondering why big companies are “wasting” their money on these expensive GPUs almost as fast as your at-home GPU, you must watch this video on why that is untrue.

Check out the graphic to understand and hopefully you’ll connect the dots on these “absurd” costs from Nvidia.

Combination of Methods

Out-of-Core Training: Paper

Methods: Optimizer offloading, Gradient offloading, Parameter offloading

This is a more extreme form of model offloading where the entire Model is stored on disk, and only small parts are loaded into GPU memory as needed. This allows for the training of huge models that would not fit into the memory of a single machine, even with other offloading techniques. However, it requires a fast storage system to minimize the impact on training speed.

Offloading: Paper,

Methods: Optimizer offloading, Gradient offloading, Parameter offloading

Developed by Microsoft, ZeRO is a memory optimization technique that reduces the memory footprint of model parameters, gradients, and optimizer states. It does this by partitioning these elements across data parallel processes, effectively allowing the model size to scale linearly with the number of GPUs.

Sharding: Paper

Methods: Optimizer offloading, Gradient offloading, Parameter offloading, and data parallel

Truly massive models and datasets; we have Sharding. We split the Model and the data across multiple machines, each responsible for different parts of the Model and data subsets. Among the more recent methods shown here, and is inspired by Deepspeed’s offload mechanisms.

Cuda OOM? Learn the Basics of Larger-Than-Memory Models

Splitting the Model

Offloading the Model

Shrinking the Model

Combination of Methods

Written by Charles Curt