Streamline Model Deployment on Vertex AI using ONNX

Published in

Google Cloud - Community

8 min readFeb 14, 2023

Fig. 1 — Vertex AI and ONNX — Image from author

Since the launch of Vertex AI, I have been deploying models faster than I ever have before. In the simplest scenario, Vertex AI provides prebuilt managed container images, allowing you to serve your models from a script.

But here is the question:

What if existing Vertex AI pre-built container images do not have the right dependencies or do omit your framework?
How can you keep deploying your models on Vertex AI?

In this post, I will propose a possible solution to these scenarios. I will train a scikit-learn model to predict whether annual income exceeds $50K / year based on census data with a library version which is not supported by Vertex AI prebuilt training containers. And I will show how you can leverage ONNX on Vertex AI to deploy it. By the end, you will learn how ONNX and Vertex AI are better together. Using ONNX on Vertex AI not only will streamline the process of model deployment but will also provide serving frameworks that are natively supported by Vertex AI, making the platform even more open.

It is important to say the content of the article represents a first attempt. None about the approach or code shared is production-ready.

Recap: What is ONNX?

If you already know what ONNX is and how it supports the MLOps lifecycle, feel free to skip this section.

Imagine that you have a model A trained with framework X and a model B trained with framework Y. Because different ML frameworks require different serving optimization and may require different infrastructure resources, the ML engineering team needs to provide different deployment strategies, maintain performant serving runtimes and its CD pipelines for each framework. Managing this variety of strategies, runtimes and CD pipelines over time can increase the operational overhead to productionize models, and is not an uncommon challenge that ML engineering teams encounter when deploying models at scale.

Fig. 2 — Model Deployment pipeline without ONNX (example) — Image from Author

MLOps helps manage various operational challenges by leveraging DevOps best practices. For example, you may build a common image with minimal dependencies and create CD pipelines that build the serving container incrementally depending on the training framework. But having such a process in place may require time depending on several factors — including the team’s capacity and skills, which can be limited resources that are in very high demand.

Is there a simpler way to solve the ML framework-deployment puzzle?

One option would be to introduce an interoperability standard to serialize models and make their serving code independent from the underlying runtime. Given model A trained with framework X and model B trained with framework Y, you transform both models into a representation that uses same runtime to serve them using same CD pipeline.

ONNX stands for Open Neural Network Exchange and it is an open-source format which provides a common and performant representation of AI models, both for deep learning (ONNX) and traditional ML models (ONNXML).

Fig. 3 — Model Deployment pipeline with ONNX (example) — Image from Author

ONNX represents models as computation graphs (protobuf) using built-in versioned operators and specific data types to guarantee compatibility across implementations. Model metadata is also supported. The main components of ONNX are:

Converters to turn models of any supported framework into ONNX format.
Runtimes to enable the execution of ONNX models on several OSS, chip architectures and hardware accelerators
Optimizers to perform arbitrary optimizations on ONNX models for faster inference

Most of those capabilities are accessible via a Python API.

Back to our scenario, once you build your scikit-learn pipeline, you can use the sklearn-onnx converter to turn the pipeline from scikit-learn into ONNX. Then you can deploy your model on a Vertex AI Endpoint by containerizing the onnxruntime in a simple REST service using Fast API. In this way you will get a performant serving service which is independent of the framework dependencies thanks to ONNX conversion.

Now that you know what ONNX is and how you can benefit from using it as part of your MLOps process, let me describe to you how to train and deploy a simple scikit-learn model on Vertex AI using ONNX.

Training and deploying a simple model on Vertex AI using ONNX

After you build your pipeline using scikit-learn API, you can convert it into ONNX format using sklearn-onnx. See the pseudocode that I used:

# Libraries
import numpy as np
import pandas as pd
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import Int64TensorType, StringTensorType

# Helpers
def read_data(data_path: str) -> pd.DataFrame:
   …
   return pipeline

def build_pipeline(cat_features: List[str], num_features: List[str]) -> Pipeline:
   …
   return pipeline

def train(pipe: Pipeline, x_train: pd.DataFrame, y_train: pd.DataFrame) -> object:
   …
   return trained_pipeline

def get_schema(dataframe: pd.DataFrame) -> List[Dict[str, str]]:
   """
   Get schema from a pandas dataframe.
   Args:
       dataframe: pandas dataframe
   Returns:
       schema
   """
   schema = []
   for col, col_type in zip(dataframe.columns, dataframe.dtypes):
       if col_type == "object":
           schema_type = StringTensorType([None, 1])
       else:
           schema_type = Int64TensorType([None, 1])
       schema.append((col, schema_type))
   return schema


def save_onnx(model: Pipeline, schema: List[Tuple], file_path: str) -> None:
   """
   Save a model to ONNX format.
   Args:
       model: trained model
       schema: schema
       file_path: path to save the model
   Returns:
       None
   """
   gs_prefix = 'gs://'
   gcsfuse_prefix = '/gcs/'
   if file_path.startswith(gs_prefix):
     file_path = file_path.replace(gs_prefix, gcsfuse_prefix)
   if not os.path.exists(os.path.dirname(file_path)):
       os.makedirs(os.path.dirname(file_path))

   onnx_model = convert_sklearn(model, initial_types=schema)
   serialized_model = onnx_model.SerializeToString()
   with open(file_path, "wb") as f:
       f.write(serialized_model)
   f.close()

if __name__ == "__main__":

   # Read data
    print("Reading data...")
    df = read_data(...)

   # Build pipeline
    print("Building pipeline...")
    pipeline = build_pipeline(...)

    # Train
    print("Training...")
    train_pipe = train(...)

   # Save model
   print("Saving model...")
   schema = get_schema(x_train)
   save_onnx(train_pipe, schema, model_path)
   print("Done!")

Notice that skl2onnx has a convert_sklearn method which produces an equivalent ONNX model of the given scikit-learn model. The convert_sklearn requires the user to manually define the input’s name and types. Also this method allows you to specify the opset (operators set) which allows runtime compatibility. Operators are versioned in terms of representations (IR) and runtimes, and you can add Model metadata including name and documentation for the produced ONNX model.

Once you’ve created the ONNX ModelProto representation, you can use ONNX Runtime (ORT) to serve your model. At the time of writing, Vertex AI does not support an ONNX pre-built serving container. So, you need to build a custom container compatible with Vertex AI.

First, you code the server. Below you can see a simple example using ONNX Runtime and Fast API.

# Libraries
import os
from typing import List
import numpy as np
from onnxruntime import InferenceSession
from fastapi import FastAPI
from pydantic import BaseModel

# Variables
MODEL_PATH = "/app/model.onnx"
AIP_HEALTH_ROUTE = os.environ['AIP_HEALTH_ROUTE']
AIP_PREDICT_ROUTE = os.environ['AIP_PREDICT_ROUTE']

# initiate serving server
app = FastAPI(title="Serving Model")


# represent data point
class Person(BaseModel):
   age: int
   workclass: str
   fnlwgt: int
   education: str
   education_num: int
   marital_status: str
   occupation: str
   relationship: str
   race: str
   sex: str
   capital_gain: int
   capital_loss: int
   hours_per_week: int
   native_country: str

# represent records
class Records(BaseModel):
   instances: List[Person]


# load model
@app.on_event("startup")
def load_inference_session():
   global sess
   sess = InferenceSession(MODEL_PATH)


# check health
@app.get(AIP_HEALTH_ROUTE, status_code=200)
def health():
   return dict(status="healthy")

# get prediction
@app.post(AIP_PREDICT_ROUTE)
async def predict(records: Records, status_code=200):
   predictions = []
   for person in records.instances:
       # convert data to numpy array
       data = dict(
           age=np.array([[person.age]]),
           workclass=np.array([[person.workclass]]),
           fnlwgt=np.array([[person.fnlwgt]]),
           education=np.array([[person.education]]),
           education_num=np.array([[person.education_num]]),
           marital_status=np.array([[person.marital_status]]),
           occupation=np.array([[person.occupation]]),
           relationship=np.array([[person.relationship]]),
           race=np.array([[person.race]]),
           sex=np.array([[person.sex]]),
           capital_gain=np.array([[person.capital_gain]]),
           capital_loss=np.array([[person.capital_loss]]),
           hours_per_week=np.array([[person.hours_per_week]]),
           native_country=np.array([[person.native_country]])
       )
       predictions.append(sess.run([], data)[0].tolist())
   return dict(predictions=predictions)

It is important to highlight that your server does not need scikit-learn to generate predictions. This is because you deployed the ONNX representation of your original pipeline using onnxruntime (ORT). To generate predictions, ORT requires an InferenceSession. Within the InferenceSession, ORT orchestrates an optimized execution of operator kernels via providers which contains the set of kernels for a specific execution target (in this case, CPU). Once the InferenceSession is instantiated, the model will be ready to return predictions.

Next, you can use the Artifact Registry and Cloud Build on Google Cloud to create the custom container image using a Dockerfile. Below is the Dockerfile I used.

FROM python:3.7

COPY requirements.txt .

RUN pip3 install --upgrade pip && \
   pip3 install -r requirements.txt

COPY ./app /app

EXPOSE 8080

CMD ["uvicorn", "app.api:app", "--host", "0.0.0.0", "--port", "8080"]

where the requirements file specifies the following libraries:

numpy==1.21.6
protobuf==3.20.3
onnxruntime==1.13.1
fastapi==0.89.1
uvicorn==0.20.0
pydantic==1.10.4

And then you can use this gcloud command to build the custom container image.

gcloud builds submit . --tag=your-region-docker.pkg.dev/your-project/your-repository/scikit-learn-onnx-cpu --machine-type='n1-highcpu-32' --timeout=900s --verbosity=info

After you’ve created the ONNX serving image, you need to create a Model Resource by uploading the model to the Vertex AI Model Registry with the custom container image using the Vertex AI Python SDK.

model = vertex_ai.Model.upload(
   display_name='sklearn-onnx-model',
   description='A simple Sklearn classifier with ONNX runtime',
   serving_container_image_uri='your-onnx-serving-image',
   serving_container_predict_route='/predict',
   serving_container_health_route='/health',
   serving_container_ports=['8080'],
)

By default, a new model version will be created in the Vertex AI Model Registry.

Finally, you need to create a Vertex AI Endpoint to deploy the registered Vertex AI model for online prediction. You can create an Endpoint resource and then deploy the model or you can use the deploy method of the Model resource directly, as shown below.

endpoint = model.deploy(
   deployed_model_display_name='onnx-classifier',
   traffic_split={"0": 100},
   machine_type='n1-standard-4',
   min_replica_count=1,
   max_replica_count=1,
)

Deploying the model will take a few minutes. After the model is successfully deployed, you can call it to generate predictions using the predict method.

prediction = endpoint.predict(instances)

Below you can see an example of a prediction result.

Prediction(predictions=[[' <=50K'], [' <=50K']], deployed_model_id='1234567890', model_version_id='1', model_resource_name='projects/my-project-id/locations/my-region/models/1234567890', explanations=None)

Summary

In this article we covered how you can use ONNX together with Vertex AI to deploy your ML models. Using ONNX on Vertex AI not only will simplify the process of model deployment but you will also be able to serve frameworks that are natively supported by Vertex AI, making the platform even more open and flexible.

Of course, I faced some limitations and issues whilst working with ONNX. For example, ONNX does not support the scikit-learn library ecosystem like Category Encoders, and not all scikit-learn methods are available. See the documentation for supported operations. That’s why the content of the article represents a first attempt. None about the approach or code shared is production-ready.

Apart from some of the limitations at the time of writing, the idea of making Vertex AI even more open is so exciting. And a lot remains to explore. For example, you can measure both how ONNX and its optimizer would affect Vertex AI Endpoints.

In the meantime, I hope you found the article interesting. If so, clap or leave comments. And feel free to reach me on LinkedIn or Twitter for further discussion or if you have a question on Vertex AI, check out the Vertex AI Q&A initiative.

Thanks to Lee for feedback and suggestions.

Streamline Model Deployment on Vertex AI using ONNX

Recap: What is ONNX?

Training and deploying a simple model on Vertex AI using ONNX

Summary

References

Written by Ivan Nardini