How to Serve Models on NVIDIA Triton Inference Server* with OpenVINO Backend

OpenVINO™ toolkit
OpenVINO-toolkit
Published in
7 min readMay 24, 2024

Triton Inference Server* is an open-source software used to optimize and deploy machine learning models through model serving. OpenVINO™, an open-source toolkit, is designed to optimize and deploy deep learning models on Intel® architecture. Organizations already leveraging Triton Inference Server with NVIDIA* GPUs can seamlessly transition to Intel hardware and get the benefits of OpenVINO by integrating with the OpenVINO backend for Triton Inference Server, eliminating the need for a complete shift to OpenVINO™ Model Server.

By integrating the Triton Inference Server with the OpenVINO backend, organizations can take full advantage of hardware acceleration on Intel hardware to achieve high levels of inference performance. This is particularly observable on the latest 4th and 5th generation Intel® Xeon® Scalable processors, which feature Intel® Advanced Matrix Extensions (AMX) that support Bfloat16 (BF16) precision.

The BF16 compression format enhances performance and efficiency. By representing data in 16-bit formats, Intel® Xeon® Processors can perform dense compute operations, such as matrix multiplications, with significantly higher throughput compared to traditional 32-bit floating-point operations. This facilitates parallel processing capability and reduces memory bandwidth requirements which translates into improved power efficiency and optimized resource utilization.

By using the OpenVINO backend, organizations can unlock new realms of possibilities, leveraging the capabilities of Intel’s hardware and OpenVINO optimization, all orchestrated by Triton Inference Server’s model serving platform.

Serving a Model on Triton Server with OpenVINO Backend

This blog will show you how to deploy a model on the Triton Inference Server with OpenVINO backend starting from downloading and preparing the models to sending an inference request from the client to the server. The models covered are ONNX, PyTorch (OpenVINO format of .bin and .xml), and TensorFlow*.

The following tutorials are based on the existing tutorials on Triton Inference Server GitHub but modified to use the OpenVINO backend and Intel® CPU: NVIDIA ONNX Tutorial, NVIDIA PyTorch Tutorial, NVIDIA TensorFlow Tutorial. For full NVIDIA Triton Inference Server documentation, see here.

Tested On

Ubuntu 20.04.6 LTS

Docker version 26.1.0, build 9714adc

Triton Server 24.04 (replace 24.04 with the preferred version in the commands below)

Requirements

Install Docker

Install wget:

sudo apt install wget

Deploying an ONNX Model

  1. Build the model repository and download the ONNX model.
mkdir -p model_repository/densenet_onnx/1 



wget -O model_repository/densenet_onnx/1/model.onnx \

https://contentmamluswest001.blob.core.windows.net/content/14b2744cf8d6418c87ffddc3f3127242/9502630827244d60a1214f250e3bbca7/08aed7327d694b8dbaee2c97b8d0fcba/densenet121-1.2.onnx

2. Create a new file named config.pbtxt

name: "densenet_onnx" 
backend: "openvino"
default_model_filename: "model.onnx"

3. Place the config.pbtxt file in the model repository, the structure should look as follows:

model_repository 

|

+-- densenet_onnx

|

+-- config.pbtxt

+-- 1

|

+-- model.onnx

Note: This directory structure is how the Triton Inference Server can read the configuration and model files and must follow the required layout. Do not place any other folders or files in the model repository other than the needed model files.

4. Run the Triton Inference Server, make sure to update the ‘/path/to/model_repository’ in the command to the location on your machine.

docker run --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 -v /path/to/model_repository:/models nvcr.io/nvidia/tritonserver:24.04-py3 tritonserver --model-repository=/models 

5. Download the Triton Client code client.py from GitHub to a place you want to run the Triton Client from.

wget https://raw.githubusercontent.com/triton-inference-server/tutorials/main/Quick_Deploy/ONNX/client.py 

6. Run the Triton Client in the same location as the client.py file, install dependencies, and query the server

docker run -it --rm --net=host -v ${PWD}:/workspace/ nvcr.io/nvidia/tritonserver:24.04-py3-sdk bash 



pip install torchvision



wget -O img1.jpg "https://www.hakaimagazine.com/wp-content/uploads/header-gulf-birds.jpg"



python3 client.py

7. Output

['11.549026:92' '11.232335:14' '7.528014:95' '6.923391:17' '6.576575:88'] 

Deploying a PyTorch Model

  1. Download and prepare the PyTorch model.

PyTorch models (.pt) will need to be converted to OpenVINO format. Create a downloadAndConvert.py file to download the PyTorch model and use the OpenVINO Model Converter to save a model.xml and model.bin:

import torchvision 
import torch
import openvino as ov
model = torchvision.models.resnet50(weights='DEFAULT')
ov_model = ov.convert_model(model)
ov.save_model(ov_model, 'model.xml')

Install the dependencies:

pip install openvino 
pip install torchvision

Run downloadAndConvert.py

python3 downloadAndConvert.py 

To convert your own PyTorch model, refer to Converting a PyTorch Model

2. Create a new file named config.pbtxt

name: "resnet50 " 
backend: "openvino"
max_batch_size : 0
input [
{
name: "x"
data_type: TYPE_FP32
dims: [ 3, 224, 224 ]
reshape { shape: [ 1, 3, 224, 224 ] }
}
]
output [
{
name: "x.45"
data_type: TYPE_FP32
dims: [ 1, 1000 ,1, 1]
reshape { shape: [ 1, 1000 ] }
}
]

3. Place the config.pbtxt file in the model repository as well as the model.xml and model.bin, the folder structure should look as follows:

model_repository 

|

+-- resnet50

|

+-- config.pbtxt

+-- 1

|

+-- model.xml

+-- model.bin

Note: This directory structure is how the Triton Inference Server can read the configuration and model files and must follow the required layout. Do not place any other folders or files in the model repository other than the needed model files.

4. Run the Triton Inference Server, make sure to update the ‘/path/to/model_repository’ in the command to the location on your machine.

docker run --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 -v /path/to/model_repository:/models nvcr.io/nvidia/tritonserver:24.04-py3 tritonserver --model-repository=/models 

5. In another terminal, download the Triton Client code client.py from GitHub to the place you want to run the Triton Client from.

wget https://raw.githubusercontent.com/triton-inference-server/tutorials/main/Quick_Deploy/PyTorch/client.py 

In the client.py file, you’ll need to update the model input and output names to match those expected by the backend as the model is slightly different from the one in the Triton tutorial. For example, change the original input name used in the PyTorch model (input__0) to the name used by the OpenVINO backend (x).

6. Run the Triton Client in the same location as the client.py file, install dependencies, and query the server.

docker run -it --net=host -v ${PWD}:/workspace/ nvcr.io/nvidia/tritonserver:24.04-py3-sdk bash 



pip install torchvision



wget -O img1.jpg "https://www.hakaimagazine.com/wp-content/uploads/header-gulf-birds.jpg"



python3 client.py

7. Output

[b'6.354599:14' b'4.292510:92' b'3.886345:90' b'3.333909:136' 

b'3.096908:15']

Note: OpenVINO also has an integration into TorchServe, which enables serving PyTorch models without conversion to OpenVINO IR format. See code samples here.

Deploying a TensorFlow Model

  1. Download and prepare the TensorFlow model.

Export the TensorFlow model in SavedModel format:

docker run -it --gpus all -v ${PWD}:/workspace nvcr.io/nvidia/tensorflow:24.04-tf2-py3 



python3 export.py

The model will need to be converted to OpenVINO format. Create a convert.py file to use the OpenVINO Model Converter to save a model.xml and model.bin:

import openvino as ov 
ov_model = ov.convert_model(' path_to_saved_model_dir’)
ov.save_model(ov_model, 'model.xml')

Install the dependencies:

pip install openvino 

Run convert.py

python3 convert.py 

To convert your TensorFlow model, refer to Converting a TensorFlow Model

2. Create a new file named config.pbtxt

name: "resnet50" 
backend: "openvino"
max_batch_size : 0
input [
{
name: "input_1"
data_type: TYPE_FP32
dims: [-1, 224, 224, 3 ]
}
]
output [
{
name: "predictions"
data_type: TYPE_FP32
dims: [-1, 1000]
}
]

3. Place the config.pbtxt file in the model repository as well as the model.xml and model.bin, the structure should look as follows:

model_repository 

|

+-- resnet50

|

+-- config.pbtxt

+-- 1

|

+-- model.xml

+-- model.bin

Note: This directory structure is how the Triton Inference Server can read the configuration and model files and must follow the required layout. Do not place any other folders or files in the model repository other than the needed model files.

4. Run the Triton Inference Server, make sure to update the ‘/path/to/model_repository’ in the command to the location on your machine.

docker run --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 -v /path/to/model_repository:/models nvcr.io/nvidia/tritonserver:24.04-py3 tritonserver --model-repository=/models 

5. In another terminal, download the Triton Client code client.py from GitHub to the place you want to run the Triton Client from.

wget https://raw.githubusercontent.com/triton-inference-server/tutorials/main/Quick_Deploy/TensorFlow/client.py 

6. Run the Triton Client in the same location as the client.py file, install dependencies, and query the server.

docker run -it --net=host -v ${PWD}:/workspace/ nvcr.io/nvidia/tritonserver:24.04-py3-sdk bash 



pip install --upgrade tensorflow



pip install image



wget -O img1.jpg "https://www.hakaimagazine.com/wp-content/uploads/header-gulf-birds.jpg"



python3 client.py

7. Output

[b'0.301167:90' b'0.169790:14' b'0.161309:92' b'0.093105:94' 

b'0.058743:136' b'0.050185:11' b'0.033802:91' b'0.011760:88'

b'0.008309:989' b'0.004927:95' b'0.004905:13' b'0.004095:317'

b'0.004006:96' b'0.003694:12' b'0.003526:42' b'0.003390:313'

...

b'0.000001:751' b'0.000001:685' b'0.000001:408' b'0.000001:116'

b'0.000001:627' b'0.000001:933' b'0.000000:661' b'0.000000:148']

Summary

Overall, the combination of the Triton Inference Server and the OpenVINO backend provides a powerful solution for deploying and serving machine learning models with hardware acceleration, optimization, and model serving capabilities.

For further optimizations and tuning of model parameters in the ‘config.pbtxt’ on the server, please refer to: triton-inference-server/openvino_backend: OpenVINO backend for Triton. (github.com)

If not already using the Triton Inference Server, it is recommended to get started with the OpenVINO™ Model Server instead.

Notices & Disclaimers

Intel technologies may require enabled hardware, software, or service activation.

No product or component can be absolutely secure.

Your costs and results may vary.

Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

--

--

OpenVINO™ toolkit
OpenVINO-toolkit

Deploy high-performance deep learning productively from edge to cloud with the OpenVINO™ toolkit.