Increase CUDA memory with Sysmem Fallback Policy

usamakenway
3 min readNov 2, 2023

Nvidia recently released driver 536.40 which enables Sysmem Fallback Policy. This allows developers to utilize CPU RAM as overflow GPU VRAM when running out of GPU memory. The driver efficiently shares CPU RAM with GPU VRAM as one unified CUDA memory pool. Which is useful in testing inference model, some training scenarios, while running Stable diffusion and LLM models.

Usefulness in training and Inference testing

This is useful for testing and development where a small VRAM boost is needed to avoid crashes, like when training large AI models that fluctuate in memory usage. It avoids out-of-memory errors, though it may take little longer when it uses shared vram.

For example: If my GPU VRAM is 10GB and during training allocation is reaching 9.5 GB with a certain batch size, and if it fluctuates from 9 GB to 10.5 GB. This feature will save you.

Performance takes a hit when relying heavily on CPU RAM versus dedicated GPU VRAM. So this feature is best used minimally, only when about to exceed available GPU VRAM.

Previously, libraries like Accelerate offloaded layers to CPU and SSD when GPU memory was consumed by the help of device_map, but at a performance cost. Nvidia’s shared memory approach is faster while still providing an overflow buffer.

accelerate Library’s device_map offloading

Without Sysmem Fallback

Here is an example of trying to load tensor that would require 13 GB memory. In this case we get the Cuda out of memory error.

Memory error with [Prefer No Sysmem Fallback]

Its same as loading LLM without device map such as in this:
AutoModelForCausalLM.from_pretrained(“meta-llama/Llama-2–7b-chat-hf”)

Setting for Sysmem Fallback

For this You need to have the driver 536.40 or newer. If your old GPU didnt get driver support for this specific driver or newer, then you cannot utilize this. But all the last few years of Nvidia GPUs support this.

First update the driver and open Nvidia Control Panel.

If you set this in Global settings, all apps including games would utilize this. I would recommend to go to Program settings instead and select specific python.exe. The one that belongs in your virtual envrionment “venv/Scripts/python.exe”

Setting it to [Prefer sysmem Fallback]

Results

Now the same code ran on my RTX 3060. Where it required 13GB and its allocating 13 GB of VRAM.

No Memory error with [Sysmem Fallback]. And you can see allocated GPU memory

Conclusion

Overall, Sysmem Fallback Policy gives a flexible memory boost for development and avoids crashes, but dedicated GPU VRAM remains ideal for gaming, production training and inference.

My contacts:
https://www.linkedin.com/in/usamakenway/
https://github.com/UsamaKenway

--

--