How to Deploy your AutoML Model in a Cost-effective Way

Juri Sarbach
8 min readMar 24, 2020

--

TL;DR: If your Google Cloud AutoML Vision deployment is underutilised, consider moving it to Cloud Run.

Google Cloud’s AutoML is a fine thing. If you have a standard AI problem like image classification, object detection or entity extraction, chances are that AutoML will solve it with very little effort from your side. Just import your own dataset and retrain a state of the art model with it. After a couple of hours of training, you can use your customised model right away and make predictions via an automatically deployed endpoint.

It’s a great shortcut if you need fast results without having to bother with all the data wrangling or with finding suitable neural network architectures first. As such, it very much feels like Data Science as a Service. With AutoML, Google has definitely commoditised AI, at least with respect to common computer vision or natural language processing problems.

Recently, I played around with it to build a coffee classifier web app. Take a picture of your cup of coffee and it will tell you whether it’s a cappuccino, an espresso or an americano (no flat white, sorry hipsters). Admittedly not the most useful application ever, but sufficient to show that you can indeed get a pretty accurate model for your specific use case, including an endpoint that serves your model for immediate use. It was supposed to be a fun application. But then, a glance at the Google Cloud billing report took the fun out of it quite a bit…

AutoML, as in Auto Money Leak?

The main disadvantage of deploying a model on AutoML Vision is that the infrastructure behind it doesn’t automatically scale to zero — unlike what we’re used to from serverless Google Cloud products. In fact, it doesn’t auto-scale at all. You have to decide how many nodes you want to provision and scale your deployment manually to cope with the request volume. In other words, you provision fixed capacity with at least 1 node running 24/7 — and with it the cost meter, regardless of usage. At the current pricing, my coffee classifier costs USD 30 per day and node or USD 900 per month.

That may be fine if you have an application with a substantial user base and a revenue model that pays your Google Cloud bill. But what if the application is used irregularly (like my coffee classifier) or if you just want to validate a new product with a few users? The documentation laconically mentions that you should undeploy your model to save costs. But in case of an unpredictable usage pattern, this is certainly not practical.

Cloud Run take over please

Fortunately, we have all the tools at hand to mitigate this shortcoming. So let’s replace the AutoML deployment by a solution that auto-scales with the requests (down to zero if necessary) and thus strains your purse proportionally to the usage. Time for Cloud Run!

Here’s what we are going to do:

  • After training the model on AutoML Vision, export it instead of deploying it right away;
  • download the exported model and add it to a pre-built Docker container;
  • deploy the image on Cloud Run;
  • finally, send requests to the API from a web app on App Engine to get predictions.

This may sound like quite a bit of work, but actually, it’s done in less than 10 minutes.

Exporting the model

In order to be able to export the model later, choose Edge instead of Cloud hosted when launching the training job. (You can still choose to deploy the model on AutoML later if you wish.) Also, select Higher accuracy as the goal because we will have enough resources on Cloud Run.

Once the model has finished training, click on the Container card in the TEST & USE tab, select the destination on Cloud Storage (GCS) where to export the SavedModel and click on the EXPORT button.

Moments later, the SavedModel is stored on GCS.

Next, let’s create a deployment that serves this model so that we can get predictions via HTTP request.

Deploying to Cloud Run

As in the documentation, we use the gcr.io/automl-vision-ondevice/gcloud-container-1.12.0:latest Docker image. In contrast to the documentation, though, we copy the SavedModel into the container instead of mounting a volume and deploy it to Cloud Run instead of running it locally. To automate all of this and to get an easily repeatable process, we use Cloud Build. As a result, deploying a model or a new model version is accomplished with just one simple gcloud builds submit ... command.

First things first: IAM

Before we start, let’s get the IAM permissions right so that Cloud Build can deploy to Cloud Run. First, add the Cloud Run Admin role to the Cloud Build service account (project-number@cloudbuild.gserviceaccount.com). Then, select your Cloud Run runtime service account. By default, this is the Compute Engine default service account. I usually create a distinct Cloud Run runtime service account instead so that I can assign the least privileges needed. Whatever you prefer, add the Cloud Build service account as a member with the Service Account User role to your Cloud Run runtime service account. (On the service account page, select the Cloud Run runtime service account, show the info panel, and click on the add member button.)

Build automation

Next, we need a Dockerfile to build an image with our saved_model.pb copied into:

https://github.com/jsarbach/automl-2-cloud-run/blob/master/cloud-run/automl-vision-cpu-1.12.0/Dockerfile

The build itself process is defined in a cloudbuild.yaml file (replace the xxx@project-id.iam.gserviceaccount.com with your Cloud Run runtime service account):

https://github.com/jsarbach/automl-2-cloud-run/blob/master/cloud-run/cloudbuild.yaml

The cloudbuild.yaml above defines four build steps:

  1. Download the saved_model.pb from GCS to the subfolder
  2. Build the Docker image using the Dockerfile in the subfolder
  3. Push the Docker image to Container Registry
  4. Deploy the image to Cloud Run

When kicking off the build process, we need to set three mandatory substitutions:

  • _MODEL_NAME: Arbitrary name for the ML model that will be used as Cloud Run service name; here coffee-classifier.
  • _DOCKERFOLDER: Subfolder with the Dockerfile to use. In order to be flexible to use another version of the automl-vision-ondevice image or even different images in the future, create a subfolder for every image. Like this, you can simply use a different base image by changing the value of _DOCKERFOLDER when deploying without having to change our cloudbuild.yaml. In this example, it’s the automl-vision-cpu-1.12.0 subfolder.
  • _GCS_MODEL_PATH: GCS path to the ML model export (saved_model.pb), including the gs:// prefix.

All the other substitution variables are optional and have a default value defined in cloudbuild.yaml under substitutions.

Run!

With the cloudbuild.yaml and the Dockerfile in the subfolder, we’re ready to submit the build:

gcloud builds submit --substitutions _MODEL_NAME=coffee-classifier,_DOCKERFOLDER=automl-vision-cpu-1.12.0,_GCS_MODEL_PATH=gs://my-automl-demos-vcm/model-export/icn/tf_saved_model-coffee_classifier_20200225020250–2020–02–25T20:44:26.192Z,_VERSION=20200225020250

The optional _VERSION substitution is used to tag the Docker image with the model version so that we know which one we put inside. If omitted, it defaults to latest.

One or two minutes later, the Cloud Run service is deployed and ready to serve traffic from [CLOUD_RUN_SERVICE_URL]/v1/models/default:predict. If you deploy without authentication (by setting the --allow-unauthenticated flag instead of --no-allow-unauthenticated), you can already start using the endpoint, e.g. with curl -X POST ... from the command line or requests.post(...) in Python. However, since Cloud Run exposes a publicly accessible endpoint (URL), you probably do want to activate authentication so that only authorised requests are allowed. Hence, let’s spend another 10 minutes on this.

Adding authentication

Adding authentication to a Cloud Run service is as easy as setting the --no-allow-unauthenticated flag when deploying, as shown above. On the side of the calling service, in our case the web app on App Engine, a bit more work is required. We’re going to create a service account and assign it the Cloud Run Invoker role. We then use it in our calling service to create an ID token that we are going to add to the header of the HTTP request.

Create a service account called cloud-run-invoker with the respective role:

PROJECT_ID=$(gcloud config get-value project)gcloud iam service-accounts create cloud-run-invokergcloud projects add-iam-policy-binding ${PROJECT_ID} --member serviceAccount:cloud-run-invoker@${PROJECT_ID}.iam.gserviceaccount.com --role roles/run.invoker

Since we will need the service account key to create the JWT in the web app and I don’t want to deal with handling key files, we’re going to store the key in Secret Manager (currently in beta). We can use the command line to create the key and pipe it directly into a new secret called cloud-run-invoker-key without downloading it first:

gcloud iam service-accounts keys create /dev/stdout --iam-account cloud-run-invoker@${PROJECT_ID}.iam.gserviceaccount.com | gcloud beta secrets create cloud-run-invoker-key --data-file - --replication-policy automatic

With that, we’re all set to make authorised requests against our Cloud Run service from the web app.

Using the endpoint

The demo website I created is a simple Python/Flask web app. It lets you upload an image or take a picture of your coffee with the camera, then it sends it to the Cloud Run endpoint and displays the classification result.

The core functionality — the interaction with the Cloud Run deployment — is very simple: Create a payload containing the Base64 encoded image and POST it to the Cloud Run service URL. Note the headers argument used to pass the ID token:

https://github.com/jsarbach/automl-2-cloud-run/blob/master/app-engine/main.py

The get_token() function creates the ID token using the service account key we stored in Secret Manager earlier:

https://github.com/jsarbach/automl-2-cloud-run/blob/master/app-engine/main.py

We use the Python API client library to retrieve the key. Note: The App Engine service account will need the Secret Manager Secret Accessor role.

https://github.com/jsarbach/automl-2-cloud-run/blob/master/app-engine/main.py

That’s it! We successfully moved the model deployment from AutoML to Cloud Run and used the prediction endpoint from an App Engine context.

With all the latte art on it, clearly a cappuccino

Performance considerations

So, we now have a cheaper solution, but what about performance? Typically, scale-to-zero services suffer from cold start latency when they haven’t been used for some time. Also, there are no GPUs on Cloud Run, at least for now.

Having said that, I was surprised how fast it is. I didn’t measure and compare the latency of AutoML deployment vs. Cloud Run deployment, so I won’t make a quantitative statement here. Anecdotal evidence from my personal end user perspective suggests that cold starts do indeed cause latency when the service was idle for some time, whereas inference is very fast. If I had a prototype app or a use case where latency is not the main concern, I would be perfectly happy with this kind of performance.

As always, it’s a tradeoff between performance and costs. Here, we improved a lot on the cost side while sacrificing very little on the performance side. This choice is not set in stone, of course. If you decide to switch back to AutoML, you can just deploy your model there and reroute traffic. Your Cloud Run deployment will scale to zero and with it its costs.

--

--

Juri Sarbach

Google Cloud Certified Professional Cloud Architect • Google Cloud Certified Professional Data Engineer • Data engineer and Google Cloud specialist at Panter AG