Increase CUDA memory with Sysmem Fallback Policy
Nvidia recently released driver 536.40 which enables Sysmem Fallback Policy. This allows developers to utilize CPU RAM as overflow GPU VRAM when running out of GPU memory. The driver efficiently shares CPU RAM with GPU VRAM as one unified CUDA memory pool. Which is useful in testing inference model, some training scenarios, while running Stable diffusion and LLM models.
Usefulness in training and Inference testing
This is useful for testing and development where a small VRAM boost is needed to avoid crashes, like when training large AI models that fluctuate in memory usage. It avoids out-of-memory errors, though it may take little longer when it uses shared vram.
For example: If my GPU VRAM is 10GB and during training allocation is reaching 9.5 GB with a certain batch size, and if it fluctuates from 9 GB to 10.5 GB. This feature will save you.
Performance takes a hit when relying heavily on CPU RAM versus dedicated GPU VRAM. So this feature is best used minimally, only when about to exceed available GPU VRAM.
Previously, libraries like Accelerate offloaded layers to CPU and SSD when GPU memory was consumed by the help of device_map, but at a performance cost. Nvidia’s shared memory approach is faster while still providing an overflow buffer.
Without Sysmem Fallback
Here is an example of trying to load tensor that would require 13 GB memory. In this case we get the Cuda out of memory error.
Its same as loading LLM without device map such as in this:
AutoModelForCausalLM.from_pretrained(“meta-llama/Llama-2–7b-chat-hf”)
Setting for Sysmem Fallback
For this You need to have the driver 536.40 or newer. If your old GPU didnt get driver support for this specific driver or newer, then you cannot utilize this. But all the last few years of Nvidia GPUs support this.
First update the driver and open Nvidia Control Panel.
If you set this in Global settings, all apps including games would utilize this. I would recommend to go to Program settings instead and select specific python.exe. The one that belongs in your virtual envrionment “venv/Scripts/python.exe”
Results
Now the same code ran on my RTX 3060. Where it required 13GB and its allocating 13 GB of VRAM.
Conclusion
Overall, Sysmem Fallback Policy gives a flexible memory boost for development and avoids crashes, but dedicated GPU VRAM remains ideal for gaming, production training and inference.
My contacts:
https://www.linkedin.com/in/usamakenway/
https://github.com/UsamaKenway