From Setup to Deployment : A Guide to Setting Up Multiple Models on a Single NVIDIA Triton Server

Anush Pittu
Everything is Connected
6 min readMay 2, 2024

In this post, we’ll go over the benefits of utilising a triton server for inference, as well as guide you through the process of setting up multiple models on a single NVIDIA Triton Server.

The NVIDIA Triton Inference Server (formerly known as TensorRT Inference Server) is an open-source software solution developed by NVIDIA. It provides a cloud inference solution optimised for NVIDIA GPUs. Triton simplifies the deployment of AI models at scale in production.

Triton Inference Server is designed to deploy a variety of AI models in production. It supports a wide range of deep learning and machine learning frameworks, including TensorFlow, PyTorch, ONNX Runtime, and many others.

Its primary use cases are:

- Serving multiple models from a single server instance.

- Dynamic model loading and unloading without server restart.

- Ensemble inference, allowing multiple models to be used together to achieve results.

- Model versioning for A/B testing and rolling updates.

Before you can use the Triton Docker image, Install and verify the docker installation. And If you plan on using a GPU for inference you must also install the NVIDIA Container Toolkit.

Prerequisites:

- TritonClient

pip install tritonclient[all]

Steps:

  1. After successful installation , Pull the docker image of the triton server using the following command.
docker pull nvcr.io/nvidia/tritonserver:<yy.mm>-py

Where <yy.mm> is the version of Triton that you want to pull.

For a complete list of all the variants and versions of the Triton Inference Server Container, visit the NGC Page

For example:

docker pull nvcr.io/nvidia/tritonserver:23.09-py3

2. Setup a folder structure like the below :

<model-repository>/
<model-name>/
[config.pbtxt]
[<output-labels-file> ...]
<version>/
<model-definition-file>
<version>/
<model-definition-file>
...
<model-name>/
[config.pbtxt]
[<output-labels-file> ...]
<version>/
<model-definition-file>
<version>/
<model-definition-file>
...
...

Folder structure example :

    model_repository/
├── text_detection
│ ├── 1
│ │ └── model.onnx
│ ├── 2
│ │ └── model.onnx
│ └── config.pbtxt
└── text_recognition
├── 1
│ └── model.onnx
└── config.pbtxt

3. How to write the config file

config file helps us define parameters like:

  • input data config
  • output config
  • max_batch_size
  • instance_group (GPU/CPU)
  • dynamic_batching
  • etc..

For most models, the Triton feature that provides the largest performance improvement is dynamic batching.

The dynamic batcher streamlines individual inference requests into larger batches, greatly boosting efficiency when compared to processing requests individually. To enable the dynamic batcher in Triton, simply add the following line to the bottom of the config file.

dynamic_batching { }

config file example for a RESNET50 model :

name: "custom_resnet"
backend: "python"
max_batch_size: 128
input {
name: "INPUT"
data_type: TYPE_FP32
format: FORMAT_NCHW
dims: [ 3, 224, 224 ]
}
output {
name: "OUTPUT"
data_type: TYPE_FP32
dims: [ 1000 ]
}

instance_group [{ kind: KIND_CPU }]

4. How to setup custom models using python backend

python_backend helps us serve models written in Python by Triton Inference Server. To use python_backend, the model.py file needs to follow the below file structure and naming conventions

class TritonPythonModel:
"""Your Python model must use the same class name. Every Python model
that is created must have "TritonPythonModel" as the class name.
"""

def initialize(self, args):
"""`initialize` is called only once when the model is being loaded.
Implementing `initialize` function is optional. This function allows
the model to initialize any state associated with this model.
"""

print('Initialized...')

def execute(self, requests):
"""`execute` must be implemented in every Python model. `execute`
function receives a list of pb_utils.InferenceRequest as the only
argument. This function is called when an inference is requested
for this model.
"""

responses = []

# Every Python backend must iterate through list of requests and create
# an instance of pb_utils.InferenceResponse class for each of them.
# Reusing the same pb_utils.InferenceResponse object for multiple
# requests may result in segmentation faults. You should avoid storing
# any of the input Tensors in the class attributes as they will be
# overridden in subsequent inference requests. You can make a copy of
# the underlying NumPy array and store it if it is required.
for request in requests:
# Perform inference on the request and append it to responses
# list...

# You must return a list of pb_utils.InferenceResponse. Length
# of this list must match the length of `requests` list.
return responses

def finalize(self):
"""`finalize` is called only once when the model is being unloaded.
Implementing `finalize` function is optional. This function allows
the model to perform any necessary clean ups before exit.
"""

print('Cleaning up...')

And the model.py for the RESNET50 example

import numpy as np
import torch
import triton_python_backend_utils as pb_utils
from torch.utils.dlpack import to_dlpack


class TritonPythonModel:
def initialize(self, args):
device = "cuda" if args["model_instance_kind"] == "GPU" else "cpu"
device_id = args["model_instance_device_id"]
self.device = f"{device}:{device_id}"
self.model = (
torch.hub.load(
"pytorch/vision:v0.14.1",
"resnet50",
weights="IMAGENET1K_V2",
skip_validation=True,
)
.to(self.device)
.eval()
)

def execute(self, requests):
responses = []
for request in requests:
input_tensor = pb_utils.get_input_tensor_by_name(request, "INPUT")
with torch.no_grad():
result = self.model(
torch.as_tensor(input_tensor.as_numpy(), device=self.device)
)
out_tensor = pb_utils.Tensor.from_dlpack("OUTPUT", to_dlpack(result))
responses.append(pb_utils.InferenceResponse([out_tensor]))
return responses

5. Once the image is downloaded, run it using the following docker command

docker run -d -p 8000:8000 -v {triton_repo_path}:/models {tag} /bin/bash`

where “triton_repo_path” should be absolute path to the “model_repository”

for above example it would be ‘../../model_repository’

Inside the container some libraries need to be installed which can be done by running the following commands

apt-get update && apt-get install -y libgl1

pip3 install torch==1.13.0+cu117 -f https://download.pytorch.org/whl/torch_stable.html torchvision==0.14.0+cu117 pillow opencv-python pandas

6. Now the server can be started by simply running the below

tritonserver --model-repository /models

Instead of running the above in the terminal the same can be done using the python code

# Run the container and store the id in a variable
container_id = subprocess.check_output(
f'docker run -d -p 8000:8000 -v {triton_repo_path}:/models {tag} /bin/bash -c "apt-get update && apt-get install -y libgl1 && pip3 install torch==1.13.0 -f https://download.pytorch.org/whl/torch_stable.html torchvision==0.14.0 pillow opencv-python pandas && tritonserver --model-repository /models"',
shell=True).decode('utf-8').strip()

# Kill and remove the container at the end
subprocess.call(f'docker kill {container_id}', shell=True)

Inference part :

Now that we understood how to setup the server, let’s look at the inference code for the same RESNET50 example and if you are following along, do not forget to include a RESNET50 labels file

import argparse
import json
import sys
import warnings
import validators
import numpy as np
import torch
import tritonclient.http as httpclient
from tritonclient.utils import *


# Ignore warning messages
warnings.filterwarnings("ignore")

# Define constants and variables for the model, image URL, server URL, verbosity, and label file
model_name = 'custom_resnet'
image_url = "http://images.cocodataset.org/test2017/000000557146.jpg"
url = "localhost:8000"
label_file = './model_repo/custom_models/custom_resnet/resnet50_labels.txt'

# Load utility functions for processing with NVIDIA's DeepLearningExamples
utils = torch.hub.load(
"NVIDIA/DeepLearningExamples:torchhub",
"nvidia_convnets_processing_utils",
skip_validation=True,
)

triton_client = httpclient.InferenceServerClient(url)

# Read labels from the label file and create a dictionary
with open(label_file) as f:
labels_dict = {idx: line.strip() for idx, line in enumerate(f)}

# Define input and output names for the model
input_name = "INPUT"
output_name = "OUTPUT"

# Prepare input batch from the image URL using NVIDIA's utility function
batch = np.asarray(utils.prepare_input_from_uri(image_url))

# Create inference input and output objects for Triton
input = httpclient.InferInput(input_name, batch.shape, "FP32")
output = httpclient.InferRequestedOutput(output_name)

# Set input data from the prepared batch
input.set_data_from_numpy(batch)

# Perform a single inference using Triton Inference Server
results = triton_client.infer(
model_name=model_name, inputs=[input], outputs=[output]
)

# Get output data from the results and find the index of the maximum value
output_data = results.as_numpy(output_name)
max_id = np.argmax(output_data, axis=1)[0]

# Print the class label corresponding to the maximum value
print("Results is class: {}".format(labels_dict[max_id]))

And on successful inference, the output should look something like the below

Downloading: “https://github.com/NVIDIA/DeepLearningExamples/zipball/torchhub" to /root/.cache/torch/hub/torchhub.zip

Results is class: TABBY

To send multiple inference requests after enabling dynamic batching as shown above , the inference code can be modified like the below

monkey.patch_all()

def infer_and_print_result(image_url):
batch = np.asarray(utils.prepare_input_from_uri(image_url))
input = httpclient.InferInput(input_name, batch.shape, "FP32")
output = httpclient.InferRequestedOutput(output_name)
input.set_data_from_numpy(batch)
results = triton_client.infer(
model_name=model_name, inputs=[input], outputs=[output]
)
output_data = results.as_numpy(output_name)
max_id = np.argmax(output_data, axis=1)[0]
print("Results for {}: {}".format(image_url, labels_dict[max_id]))

jobs = [gevent.spawn(infer_and_print_result, image_url) for image_url in image_urls]
gevent.joinall(jobs)

--

--