Cloud GPU instances with the largest VRAM 2022

Overcome the out of memory error with a larger GPU

Aleix López Pascual
3 min readOct 14, 2022

RuntimeError: CUDA out of memory

If you’re reading this, you have probably encountered this error before. And as you did, so did I last week while I was trying to fine-tune a stable diffusion model to generate artificial food images.

In the past months, we have seen an increasing trend driving demand for larger GPU memory capacities. This demand is mainly driven by larger deep learning models and datasets: GPT-3, DALL-E 2, stable-diffusion… And as a consequence, GPU memory is becoming a big bottleneck for many. Surely there are ways to reduce memory consumption such as reducing the batch size and training with a lower data type precision. But sometimes, you just need more memory.

In this article, I am going to describe briefly what are the best cloud GPU instances you can use to solve this issue.

AWS

  • Catalog Amazon EC2 instances: here. The ones needed for deep learning are under the accelerated computing section.
  • Do not confuse Instance Memory (RAM) with GPU Memory (VRAM). The memory issue that causes the out of memory error is related to GPU memory.
  • The GPU memory shown in the AWS EC2 catalog is the total GPU memory. It accounts for the memory of all GPUs in the instance. Ie, imagine we pick the p3dn.16xlarge instance, which has 128GB GPU memory and 8 GPUs. The GPU memory will be 128 GB / 8 = 16 GB. The single GPU memory is the one responsible for our error. In other words, increasing the instance type from p3dn.2xlarge to p3dn.16xlarge will not solve the out of memory error. Both instances use NVIDIA V100 (16GB) GPUs.
  • At the present time (2022Q4), the larger GPU memory you can get in AWS is 40GB (NVIDIA A100) from a p4d.24xlarge instance (8 GPUs), at a price of $32.7726 per hour.
  • More information about choosing the right AWS EC2 instance can be found here.
Source: Choosing the right GPU for deep learning on AWS | by Shashank Prasanna | Towards Data Science

GCP

  • Catalog of Google virtual machine instances: here (accelerated-optimized section).
  • Larger GPU memory: 80GB (NVIDIA A100) from the family of A2 ultra machine series.
  • Single GPU instance of 80GB at a price of $5.0688 per hour.
  • 8 GPUs instance of 80GB at a price of $40.5504 per hour.

Paperspace

  • Catalog of Paperspace GPU instances: here.
  • Larger GPU memory: 90GB (NVIDIA A100).
  • Single GPU instance of 90GB at a price of $3.09 per hour.
  • 8 GPUs instance of 90GB at a price of $24.72 per hour.

Datacrunch.io

  • Catalog of DataCrunch.io GPU instances: here.
  • Larger GPU memory: 80GB (NVIDIA A100).
  • Single GPU instance of 80GB at a price of $1.85 per hour.
  • 8 GPUs instance of 80GB at a price of $14.80 per hour.

Lambda Labs

  • Catalog of Lambda Labs GPU instances: here.
  • Larger GPU memory: 48GB (NVIDIA RTX A6000).
  • Single GPU instance of 48GB at a price of $0.80 per hour.
  • 4 GPUs instance of 48GB at a price of $3.20 per hour.

Once you have chosen your instance and started training your model, I encourage you to open a terminal within your virtual machine and execute the following command: nvidia-smi -l 1. This will give you real-time insights on the GPU usage.

And that’s it! Thank you for reading! I hope you enjoyed this article. If you’d like, add me on LinkedIn.

--

--

Aleix López Pascual

Senior Data Scientist @ Glovo | Competitions Expert @ Kaggle | Writer @ Medium | MSc in High Energy Physics, Astrophysics and Cosmology