Manage CUDA cores— ultimate memory management strategy with PyTorch.

Soumen Sardar
6 min readJun 11, 2023

--

Photo by Rinaldi Akbar on Unsplash

Section 1

Introduction

PyTorch, a popular deep learning framework, provides seamless integration with CUDA, allowing users to leverage the power of GPUs for accelerated computations. However, efficient memory management is crucial when working with large-scale models and datasets. In this article, we will explore PyTorch’s CUDA memory management options, cache cleaning methods, and library support to optimize memory usage and prevent potential memory-related issues.

Problem Statement

If you are reading this post, I believe we might have seen this error before:

CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 10.76 GiB total capacity; 4.29 GiB already allocated; 10.12 MiB free; 4.46 GiB reserved in total by PyTorch)

In the next few paragraphs, I will explain the various solutions to this problem and the pros and cons of each strategy. But in this blog, I will focus on Point 7, Memory Cleanup. In my experience, I have seen that poor handling of CUDA cores and cache can lead to memory errors although you have sufficient VRAM for your batch of inputs. You can skip to Section 2.

Poor handling of CUDA cores and cache can lead to memory errors although you have sufficient GPU RAM.

The solutions we mostly encountered were as follows:

  1. Reduce Batch Size: Decrease the batch size to fit the model within available GPU memory. [Recommended]
  2. Limit Model Complexity: Reduce the model’s complexity by decreasing layers or hidden size. [NOT Recommended]
  3. Gradient Accumulation: Accumulate gradients over mini-batches to train with larger effective batch sizes. [Recommended]
  4. Data Augmentation and Loading: Apply data augmentation techniques during loading to generate augmented data on-the-fly. [Recommended]
  5. Memory Optimization Libraries: Utilize libraries like Apex or PyTorch Lightning for memory optimization techniques. [Highly Recommended]
  6. Gradient Clipping: Limit gradient magnitudes to prevent exploding gradients and reduce memory consumption. [Recommended]
  7. Memory Cleanup: Delete unnecessary tensors, trigger PyTorch’s garbage collector, and clear the GPU cache. [Highly Recommended]
  8. Upgrade GPU or Use Multiple GPUs: Upgrade to a GPU with higher memory or distribute workload across multiple GPUs. [Slightly Recommended]

Here is a list of the pros and cons of the above approaches:

  1. Reduce Batch Size:
  • Pros: Allows training larger models within limited memory and improves training speed.
  • Cons: This may lead to slower convergence, loss of parallelism, and less accurate gradient estimates.

2. Limit Model Complexity:

  • Pros: Reduces memory footprint and enables training on GPUs with lower memory capacity.
  • Cons: This may sacrifice model performance and limit the model’s ability to capture complex patterns.

3. Gradient Accumulation:

  • Pros: Increases effective batch size and enables training larger models with limited memory.
  • Cons: Slower training process due to accumulating gradients over multiple mini-batches.

4. Data Augmentation and Loading:

  • Pros: Increases effective training data and reduces memory usage by generating augmented data on-the-fly.
  • Cons: Additional computational overhead for data augmentation during training.

5. Memory Optimization Libraries:

  • Pros: Provides automatic memory optimization techniques and simplifies memory management.
  • Cons: This may require additional library dependencies and a learning curve for implementation.

6. Gradient Clipping:

  • Pros: Prevents exploding gradients and reduces memory consumption.
  • Cons: This may affect the model’s ability to converge or cause loss of important gradient information.

7. Memory Cleanup:

  • Pros: Frees up memory by deleting unnecessary tensors and clearing GPU cache.
  • Cons: May require manual intervention and careful management, potential performance impact.

8. Upgrade GPU or Use Multiple GPUs:

  • Pros: Enables training larger models with higher memory capacity and improves parallelism.
  • Cons: Costly hardware upgrade, limited available GPU resources, increased complexity in multi-GPU setup.

Section 2

To solve memory errors we can see that many options require compromising the data, model, or both. Few require more time, and few costs more money. But here I want to discuss a situation where we actually have a decent GPU but due to poor memory handling the training is crashing.

While I was training on an RTX 3090 (24GB) GPU, my training was crashing after a few epochs, not instantaneously. Which is wired. Which also suggests that I am not using the GPU RAM properly.

There are mainly two kinds of Out of Memory errors. (1) Instantaneous crash or (2) crash after a few epochs.

Fortunately, we can solve the problem by handling our GPU memory properly in case of our training is crashing after a few epochs.

Cache Cleaning

To clean the unused memory, we can call torch.cuda.empty_cache(). But I have a question. Is that all you need? The recommended way is to delete the local variables (using del) first and then call torch.cuda.empty_cache().

Delete local variables first and then call torch.cuda.empty_cache() to effectively clean the cached GPU VRAM.

Example:

Let us understand with an example. Here is an example of a simple training loop in PyTorch.

# run training loop
for epoch_idx in range(epochs):
# run one epoch
for data in dataloader:
# get torch.Tensor data
x_batch, y_batch = data
# move to CUDA
x_batch.cuda()
y_batch.cuda()
# run forward pass
y_hat = nn.parallel.data_parallel(model, x)
# compute loss
err = compute_loss(y_batch, y_hat)
# backpropagate error
err.backward()
# update model parameters
optimizer.step()

In the above code x_batch, y_batch, y_hat, and err are local variables. After training a batch we can safely delete the references.

# run training loop
for epoch_idx in range(epochs):
# run one epoch
for data in dataloader:
# get torch.Tensor data
x_batch, y_batch = data
...
# update model parameters
optimizer.step()
# delete locals
del x_batch
del y_batch
del err
del y_hat
# Then clean the cache
torch.cuda.empty_cache()
# then collect the garbage
gc.collect()

Section 3

Controlling CUDA Caching Allocator Program

The use of a caching allocator can interfere with memory-checking tools such as cuda-memcheck. To debug memory errors using cuda-memcheck, set PYTORCH_NO_CUDA_MEMORY_CACHING=1 in your environment to disable caching.

To debug memory errors using cuda-memcheck, set PYTORCH_NO_CUDA_MEMORY_CACHING=1 in your environment to disable caching.

The behavior of the caching allocator can be controlled via the environment variable PYTORCH_CUDA_ALLOC_CONF. The format is PYTORCH_CUDA_ALLOC_CONF=<option>:<value>,<option2>:<value2>... Two important available options:

Avoid Fragmentation

1. max_split_size_mb: prevents the native allocator from splitting blocks larger than this size (in MB). This can reduce fragmentation and may allow some borderline workloads to complete without running out of memory. Performance costs can range from ‘zero’ to ‘substantial’ depending on allocation patterns. The default value is unlimited, i.e. all blocks can be split.

Clean Before its too Late

2. garbage_collection_threshold: helps actively reclaim unused GPU memory to avoid triggering expensive sync-and-reclaim-all operations (release_cached_blocks), which can be unfavorable to latency-critical GPU applications (e.g., servers). Upon setting this threshold (e.g., 0.8), the allocator will start reclaiming GPU memory blocks if the GPU memory capacity usage exceeds the threshold (i.e., 80% of the total memory allocated to the GPU application). The algorithm prefers to free old & unused blocks first to avoid freeing blocks that are actively being reused. The threshold value should be between greater than 0.0 and less than 1.0.

For more details please refer: CUDA semantics — PyTorch 2.0 documentation

Summary:

max_split_size_mb: Tells the cache allocator how to split the available VRAM. For example, if I have 2000MB available VRAM, a max_split_size_mb:500 will split the memory into 4 blocks.

garbage_collection_threshold: Tells the cache allocator when to run an expensive garbage collection call. A frequent call will slow down the training process. Not calling it may cause a memory error even though you have sufficient memory. For example, if I set garbage_collection_threshold:0.75 with the above setting the garbage will be cleaned after every 3rd block allocation. Thus, making the unused blocks free for further use.

How to calculate block size?

This is another story and cannot be covered in this blog post. Please let me know if you need the formula to calculate the memory.

Thanks Given

This blog is possible for ChatGPT, PyTorch, and StackOverflow.

Thanks for giving your precious time to this post. I hope this is helpful and could make our life easier. If you made it so far and find this article helpful please consider a like and comment your thoughts. Please share if you like it.

Regards,

Soumen Sardar

--

--

Soumen Sardar

Soumen Sardar is Data Science Lead, AI at Smiths Detection. B.Tech CSE, and MS in Data Science, LJMU. PG ML Certification Stanford Online, Diploma in DL IIITB