Astounding differences in the price of Cloud GPU instances

Jagane Sundar
InfinStor
Published in
3 min readNov 4, 2020

For a variety of reasons, I was attempting to quantify the cost of ML training. It became quickly apparent to me that the most significant line item was going to be the cost of renting GPU instances in the cloud. What was more surprising to me was the difference in the prices that the three major cloud vendors charged for GPU instances.

Here is a quick look at the price of a specific type of GPU instance: 4 Nvidia V100 GPUs with 64GB of GPU memory, about 64 vCPUs and roughly 250GB of main memory. In the AWS Cloud, the closest instance type is p3.8xlarge. In Azure it goes by the type Standard_NC24s_v3. In the Google Cloud, we compare instance type n2-standard-64 with 4 NVIDIA Tesla V100 GPUs added to it. We use a US East Coast region in each Cloud.

AWS: p3.8xlarge instance in us-east-1 running Linux

On Demand: $12.24 per hour

Spot: $ 3.672 per hour (at about 10 PM Pacific on Nov 3, 2020)

Azure: Standard_NC24s_v3 instance in East US

Pay as you go: $12.24 per hour

Spot: $1.224 per hour

Google Cloud: n2-standard-64

Pay as you go: $3.107776 + (4 * $2.48) = $13.027776 per hour

Pre-emptible: $0.75328 + (4 * $0.74) = $3.7133 per hour

As you can see above, the on-demand or pay as you go pricing is roughly around $12 — $13 in all of the clouds. Spot pricing is much much lower, and it varies significantly between clouds — it ranges from $3.672 in AWS to $3.7133 in Google Cloud to $1.224 in Azure.

It is very clear from the above that ML work must be done in spot instances to keep costs down. This turns out to be quite challenging. While most Deep Learning packages such as PyTorch and TensorFlow include capabilities to checkpoint during ANN training, the same facilities are not available in pre-processing steps. For example, it is common in medical imaging to convert thousands of images from DICOM to NIFTI format. If the conversion pre-processing step is interrupted, most ML frameworks end up restarting the whole step, thereby wasting resources.

Spot pricing seems relatively constant in Azure and Google, whereas it is actually tied to demand in AWS. This points to the relative immaturity of the spot instance offering in Google and Azure. In time, I expect these two clouds to catch up with AWS and their spot pricing will track cloud utilization better.

The takeaway from this is that Machine Learning activities should be performed on spot instances as much as possible, and pricing between the clouds should be considered carefully before each training run. Today’s ML tools do not provide the capability to run workloads in spot instances, or to move workloads between clouds. Considering the savings, these two capabilities are essential for keeping the cost of ML projects down.

Matt Bornstein of a16z has interesting things to say on this topic as well. Check out his posts here — https://a16z.com/author/matt-bornstein/

--

--

Jagane Sundar
InfinStor

Entrepreneur, Technology Enthusiast, Machine Learning student, Cloud Computing expert, Big Data expert, Distributed Coordination expert