How to reduce your ML model inference costs on Google Cloud

Reducing your ML inference costs can be so easy. 4 steps I regularly recommend to companies.

Published in

Google Cloud - Community

4 min readMay 26, 2022

In this article, I cover the most common ways of saving costs. Some of them are just applying common sense, some are a nice way to utilize another service, while others require more technical knowledge.

Cloud Run instead of Vertex AI Endpoints

One of the major disadvantages of Vertex AI Endpoints is the fact that there is no downscale to zero. At least one endpoint node is always up and running.

The cost for the smallest machine type an n1-standard-4 is $0.218499 (USD) per hour. For a model that runs the whole month (730 hours), this sums up to approx. $160 (USD) per month. And you have to pay it even if you don’t get any prediction requests. It’s ridiculous. If you just have one model up and running this is fine but it sums up over time as you deploy more and more models. Also, take into consideration that the large the machine the large the costs.

If your model does not require the usage of a GPU you might consider deploying your model to Cloud Run instead of Vertex AI Endpoints. Cloud Run has the ability to downscale to zero.

How to reduce your ML model inference costs on Google Cloud

Reducing your ML inference costs can be so easy. 4 steps I regularly recommend to companies.

Cloud Run instead of Vertex AI Endpoints

Written by Sascha Heyer