Model deployment with ONNX

Published in

Thomson Reuters Labs

4 min readMay 17, 2023

Why ONNX:

Deep learning models have become quite popular with the easy fine-tuning and promising results. Many practical use cases demand higher accuracies with fast inference. Sometimes, using larger models deliver better results but can result in slow runtimes. To accelerate the inference process, one approach is to utilize larger hardware, but this comes at the cost of higher expenses. Many training frameworks might not be best for model serving in production because the compilers are designed for faster training related features which might increase memory consumption, increased container size (with extra training-related dependencies). Different end applications may demand different inference requirements, one framework may be more suitable over the other and easier to deploy in certain places.

” ONNX is an open format built to represent machine learning models. ONNX defines a common set of operators — the building blocks of machine learning and deep learning models — and a common file format to enable AI developers to use models with a variety of frameworks, tools, runtimes, and compilers” (see onnx.ai).

In this blog we aim to understand what is ONNX and how we can use it to improve inference times in the various levels inference. We will also take deeper dive in ONNX optimizations and how to improve response times with it later in this blog series.

Model Inference Stack:

The inference times of the model can be improved at various levels of inference. The topmost level is algorithmic; where one can reduce model size (pruning), reduce the floating-point precision (quantization), look for the best algorithms available for the task Neural architecture search (NAS). The bottom most layer is the hardware where the actual inference happens, one can accelerate the process by selecting correct hardware for the requirement.

The middle level is graph compilers. The model serving times is heavily affected by the native compilers, intermediate representation and can consume more memory. The choice of framework and deployment architecture can have a significant impact on this. Current frameworks like TensorFlow, PyTorch do have native compilers but are still very framework specific and might not always be portable, interoperable. Let us understand little bit more on graph compilers:

Graph Compilers:

Each model can be represented as a DAG (Directed Acyclic Graph) where each node is the operator (ReLu, add, pooling etc.) and the link between them represents the data dependencies between nodes. The graph can be considered as intermediate representation (IR) to create a roadmap for execution on different devices (CPU, GPU, TPU).

The IR decides the execution flow for the models. The steps which are independent of each other. Below is an example of ONNX visualization of sklearn RandomForest model using Netron

Fig 2: Visualization of RandomForest onnx model using Netron

Different frameworks for graph compilers have been developing independently at their own pace, they have native dependencies, some have which makes it hard to improve inference performance at the end user.

How does ONNX help?

The deployment of a machine learning model into production usually requires replicating the entire ecosystem used to train the model, most of the times self-contained with a docker image. Once a model is converted into ONNX, the production environment only needs a runtime to execute the graph defined with ONNX operators. This runtime can be developed in any language suitable for the production application, C, java, python, JavaScript, C#, ARM…

But to make that happen, the ONNX graph needs to be saved. ONNX uses Protocol Buffers to serialize the graph into one single block. It aims at removing the redundant, unnecessary connections and makes the operations independent of each other parallel which then helps in model size.

By providing a common representation of the computation graph, ONNX helps developers choose the right framework for their task, allows authors to focus on innovative enhancements, and enables hardware vendors to streamline optimizations for their platforms.

ONNX runtime provides the runtime for the ONNX model, which then can be used to deploy models on your hardware for inference. It houses the ExecutionProviders, which enable us to accelerate the operations using various methods, such as CPU, TPU, GPU, Cuda, or TensorRT.

The overall inference process will look like this:

Conclusion:

ONNX is straightforward tool to convert the models from different frameworks, removing the framework specific dependencies from the model. This can give expected inference container image sizes which in turn will result in predictable behaviours in the production environment. In my experiments with the huggingface transformers models; I could reduce the inference image and cold start times by cleaning up a bunch of dependencies related to training and other stages of model development. In-general ONNX can be a way for platform, hardware independent inference creation.

GCP recently introduced ONNX based hosting, training in their Vertex AI service, follow medium blog to know more. Do check optimum by huggingface for ONNX optimizing, training huggingface models.

References: