Deploying a PyTorch model with Triton Inference Server in 5 minutes

Zabir Al Nazi Nabil
3 min readSep 28, 2022

--

Triton Inference Server

NVIDIA Triton Inference Server provides a cloud and edge inferencing solution optimized for both CPUs and GPUs. Triton supported backends, including TensorRT, TensorFlow, PyTorch, Python, ONNX Runtime, and OpenVino.

Triton powering PyTorch, TensorFlow, OpenVino models (Credit: Nvidia)
Triton serving PyTorch, TensorFlow, ONNX models. Image credit: Nvidia

With Triton, it’s possible to deploy PyTorch, TensorFlow, or even XGBoost / LightGBM models. Triton can automatically optimize the model for inference on the GPU.

Installation with Docker

In this article, we will deploy a speaker recognition model called ECAPA-TDNN (https://arxiv.org/pdf/2005.07143.pdf) trained with PyTorch. But the same steps can be followed to deploy any PyTorch model.

For deep learning models, we want to make sure the model inference takes place on the GPU to speed things up. I assume your system already has nvidia drivers with CUDA, nvidia-docker installed.

To install and spawn the triton inference server, just run —

docker run --gpus=1 --rm -p8000:8000 -p8001:8001 -p8002:8002 -v /path-to-model-repository:/models nvcr.io/nvidia/tritonserver:22.08-py3 tritonserver --model-repository=/models

Here, I am using GPU 1 for the Triton ( — gpus=1), port 8000 is used for the HTTP endpoint, and port 8001 is used for gRPC. You have to specify the absolute path to the model repository (the path where you will put the models) in place of path-to-model-repository.

If you want to install the latest version, you can find the tags here: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver/tags

Triton model server running

Deploying the PyTorch model

Deploying is as easy as creating some folders inside the model repository path and keeping the torchscript model file there.

But before that, we need to convert the PyTorch model to torchscript model.

Once we have the model.pt torchscript model file, we just need to place it inside the model repository directory following a simple directory structure.

model-repository/
- ecapatdnn
- config.pbtxt
- 1
- model.pt

Here, ecapatdnn is the model name, 1 is the version number, config.pbtxt contains some details about the model input and output, and finally the model.pt file is stored inside folder “1”.

name: "ecapatdnn"platform: "pytorch_libtorch"max_batch_size: 1input[{    name: "INPUT__0"    data_type:  TYPE_FP32    dims: [-1]}]output:[{    name: "OUTPUT__0"    data_type:  TYPE_FP32    dims: [512]}]

For this model, it takes a speech signal with dimension N and returns an embedding vector of dimension 512 (not considering the batch dim.). You can specify the input and output shape as per your model requirement in the dims field.

Now, let’s restart the server.

Triton server showing our deployed model

Python gRPC Client

Install the python client for Triton with:

pip install tritonclient[all]

Here, I am using the gRPC endpoint as it’s usually faster to get the response. I send an array with dimension 1x48000 — where 1 is the batch size and 48000 is the length of the audio signal.

Triton inference with python client

--

--