Model Serving at Scale with Vertex AI : custom container deployment with pre and post processing

10 min readOct 9, 2021

https://www.danielim.com/wp-content/uploads/2012/02/serve.png

What is MLOps?

MLOps or ML Ops is a set of practices that aims to deploy and maintain machine learning models in production reliably and efficiently. The word is a compound of “machine learning” and the continuous development practice of DevOps in the software field. Machine learning models are tested and developed in isolated experimental systems. When an algorithm is ready to be launched, MLOps is practiced between Data Scientists, DevOps, and Machine Learning engineers to transition the algorithm to production systems. Similar to DevOps or DataOps approaches, MLOps seeks to increase automation and improve the quality of production models, while also focusing on business and regulatory requirements. While MLOps started as a set of best practices, it is slowly evolving into an independent approach to ML lifecycle management. MLOps applies to the entire lifecycle — from integrating with model generation (software development lifecycle, continuous integration/continuous delivery), orchestration, and deployment, to health, diagnostics, governance, and business metrics. According to Gartner, MLOps is a subset of ModelOps. MLOps is focused on the operationalisation of ML models, while ModelOps covers the operationalisations of all types of AI models.

What is Vertex AI?

Vertex AI is a unified machine learning platform on GCP that offers a comprehensive set of tools and products for building and managing the life cycle of ML models in a single environment. It consolidates many of the previous offerings from the legacy AI Platform and AutoML (Table/Vision/NLP), and supplements with several new popular ML products and services such as labeling tasks, pipelines, feature store, experiments, model registry etc.

From the model deployment point of view, Vertex AI currently supports two types of deployment for custom-built models:

Pre-built container
Custom container

The pre-built container is intended for models built from the commonly-used ML frameworks including Scikit-learn, XGBoost and Tensorflow. At prediction time, the pre-built container directly calls the predict() method from the saved model artifacts of the specified framework. The pre-built container does not support custom serving code at prediction time such as the custom code needed for pre- and post-processing. The custom container, however, supports all types of ML frameworks and custom serving code. It also supports the deployment of custom models trained outside of Vertex AI. The down side of this option is that users need to build their own custom docker container in order for the deployment.

I will be showing you custom container deployment process on Vertex AI with pre and post processing. I have chosen to deploy a simple CNN classification model. You can choose model of your choice, all you have to focus on is the containerisation of the application.

The main purpose of doing all this is that we can autoscale our prediction model without having headache of managing the underlying infrastructure. Autoscaling is managed by Vertex AI.

Create a Docker Container Image

The first step of our deployment process is to create a docker image for the custom container to be deployed. As far as the model artifacts are concerned, Vertex AI allows them to be stored in a cloud storage bucket and loaded into the container during the container startup time. Alternatively, the model artifacts can also be directly embedded into the docker image as part of the image content itself. We will use the embedding option in this experimentation. I have embedded the pre and post processing logic in pre_process.py and post_process.py files respectively

I created a folder cnn_deploy in my work directory containing all of the contents that are needed for building the docker container image:

model : a subfolder containing the model artifact files.
pre_process.py: contains the preprocessing logic
post_process.py: contains the post processing logic
Dockerfile: docker build file
requirement.txt: all required Python library dependencies
app.py: Flask server code with the endpoint route definitions
tokenizer.pkl: tokenizer vector object used in training
requirements.txt: contains all the libraries and dependencies needed for the application to run.

You only have to focus on flask web app, dockerfile and requirements.txt file to understand the whole process.

Flask server

Flask is an API of Python that allows us to build up web-applications. It was developed by Armin Ronacher. A Web-Application Framework or Web Framework is the collection of modules and libraries that helps the developer to write applications without writing the low-level codes such as protocols, thread management, etc. Flask is based on WSGI(Web Server Gateway Interface) toolkit and Jinja2 template engine.

The above flask server contains two important functions that are required before deployment to Vertex AI.

predict() : It contains model prediction logic which is mapped to the predict route.
healthz(): Contains the health route. Vertex AI intermittently performs health checks on your HTTP server while it is running to ensure that it is ready to handle prediction requests. The service uses a health probe to send HTTP GET requests to a configurable health check path on your server.

Vertex AI has its own request and response format. We have to accept json requests in a specified format and send prediction response in a specified format. Check this page for more details.

Dockerfile

A Dockerfile is a text document that contains all the commands a user could call on the command line to assemble an image. Using docker build users can create an automated build that executes several command-line instructions in succession.

Now I will define what each statement in Dockerfile mean:

FROM tensorflow/tensorflow:nightly-gpu : This is the first command of Dockerfile and tensorflow/tensorflow:nightly-gpu is the base image that is required because here we have assumed that our application runs on tensorflow in backend. Tensorflow is installed on top of small linux like alpine. Here I have used gpu based tensorflow image. You can find more such images on docker hub. It is important to note that GPU based images are version specific and installs cuda and cudnn respectively. Each tensorflow version is compatible with a specific cuda version. You can find more here on their compatibility. If you don’t intend to use gpu for your project, use the cpu image. Find more on tensorflow images here .
WORKDIR /app : To specify the working directory.
COPY . /app : Copy all the contents mentioned in the application content image above to app directory.
RUN pip install — trusted-host pypi.python.org -r requirements.txt : RUN helps us execute any linux command, and any RUN command gets executed inside the container. We will install everything that is needed for our application to run using requirements.txt.
CMD gunicorn — bind 0.0.0.0:5005 — timeout=150 app:app -w 5: CMD is an entry point command. There can be only one CMD command and it is the first command to get executed when a container is run. Hence in this case app.py will get executed first. I am binding the local host with 5005 port of container using gunicorn and have exposed the same port, this way we can specify on which port the container will listen. It is not advised to use a flask server for production, hence I am using gunicorn server on top of flask. We then specify the name of the flask app that we want to run, in our case app. At last we specify how many workers we want, in my case I wanted 5 workers, by default its 1 worker. I have also specified the timeout of worker to be 150 sec, if you don’t specify timeout it will considered as 30 sec by default.

Create custom container locally

This created a custom docker image on my local machine with the specified image tag name. $PROEJCT_ID and $REGION are the placeholders for the names of my GCP project id and region. $REPOSITORY and $IMAGE are the names given to my artifact registry repository and docker image for the custom container.

Test Custom Container Locally

Here I have mapped 5005 port of local host with 5005 port of container for serving HTTP request. In my case I am going to use gpu, hence I have used gpus=all in run command.

After the container is running, use the below command to send request to the server.

The sample_input.json file contains json request which is modified according to the vertex ai request format. Note that all of the model input data fields need to be wrapped by a top-level element “instances” in an array. This data structure is required by Vertex AI in the custom container implementation. I have adjusted my code in predict() function of app.py as well to accomodate that change.

Specify your data as key value pair

The custom container is expected to return a HTTP response with a JSON content of the prediction values . This confirms that the docker image is built correctly and the custom container is working appropriately in the local environment.

Deploy Custom Container to Vertex AI

After local testing, the custom container can now be deployed to Vertex AI. At first, I created an Artifact Registry repository on GCP and pushed the docker image to this repository. Make sure you have all the required permissions for artifact registry and Vertex AI. A quick google search will tell you what you require.

Create Repo

Push repo

Then, I imported a custom model on Vertex AI using the docker image pushed in the artifact repository. Specify the model name in $MODEL_NAME placeholder

Note that in the above command, the container port is specified as 5005 and the health check and prediction routes are also specified based on their definitions in the Flask web server code. Once the model import is completed, it can be confirmed by navigating to Vertex AI console or running the following command:

View details of uploaded model

Finally, I created an endpoint and deployed the custom model to the endpoint for serving. Specify endpoint name in $ENDPOINT_NAME placeholder.

Create Endpoint

Now upload the model to endpoint. Make sure you provide correct model id and endpoint id. You can see the model id and endpoint ids by running the above two commands or in the Vertex AI console.

upload model to endpoint

where $ENDPOINT_ID is the endpoint id assigned by Vertex AI after the endpoint was created. Here I specified that each node of the cluster uses the standard 4-vCPU machine type, 1 nvidia-tesla GPU and the minimum number of nodes is 1 and the maximum number is 2(auto-scaled from1 to 2). This step might take around 20–30 minutes to complete, so be patient.

After some time we can see the deployed model at our endpoint in Vertex AI UI.

**Model uploaded to endpoint for serving**

Testing the custom container on Vertex AI

After the custom container was deployed, I ran the following tests to ensure that the endpoint works correctly:

Test using python client.

There are two important things to note here. First is that the above code is giving desired output, second is that the url and the token that is generated will be used for authorization in postman and load test using jmeter. To get the url please check this.

2. Test using postman

When we test using postman we need 3 things, the json request, the url and the authorization token. Below is the result of test using postman.

**specify the token, url and json request**

Load Test using Jmeter

The purpose of doing load test is to see whether autoscaling is happening or not.Vertex AI by default autoscales the nodes if either cpu or gpu usage reaches 60%.

I sent 300 parallel request using jmeter for 10 min and the throughput was 26.5/sec.

We can see the information related to resource utilisation inside the endpoint.

CPU utilisation reached 80%

RAM used was 5 GB

GPU utilisation reached 98% and 4 GB of VRAM was used.

Autoscaling happened because GPU reached the threshold limit of 60% first

Summary

Custom container deployment in Vertex AI is very flexible and helps to scale our model without having to worry about the cluster management. It efficiently autoscales our model when we receive high traffic and is a good choice for model serving.

Acknowledgement

I would like to acknowledge my team at Cognizer for helping me out while doing this project. I would also like to thank Jinmiao Zhang for his post on Vertex AI which helped me do this project efficiently.