Convert Detectron2 model to TensorRT

Swapnesh Khare
CARS24 Data Science Blog
4 min readOct 12, 2022

This post covers the steps needed to convert a Detectron2 (MaskRCNN) model to TensorRT format and deploy it on Triton Inference Server.

By default, there is no easy way to deploy a Detectron2 model through Triton. There is no backend as such in Triton which supports this. The workaround is to use python backend in config.pbtxt and write your own python code to load the model and predict, which runs on Triton. While this is somewhat an easy task, as of Triton 22.08, we cannot have more than 1 such models deployed through Triton on Kubernetes (issue).

To resolve this, we need to convert our Detectron2 model to TensorRT and use tensorrt_plan backend. The blog follows this tutorial, but with easier setup, optimizations and detailed steps to help others avoid getting stuck :)

The scripts required for the conversion can be found in the same tutorial repository.

Prerequisites

  1. Trained Detectron2 MaskRCNN model (.pth/.pkl file) and model config.yaml
  2. The model is trained on square images. The conversion scripts currently only support square image dimensions (ref)
  3. An example image to adjust dimensions of tensors
  4. Docker installed on system with GPU (change docker run command for CPU only system)

Step 1: Create Docker image to run conversion scripts

We first need to create a docker image with required packages to run the conversion scripts. Some PyTorch and TensorRT dependecies are needed, for which we can use thepytorch image from NVIDIA NGC and install other dependencies over it.

We have to use onnx==1.8.1 as onnx.optimizer is moved to another repo since 1.9 . Create the custom image:

docker build -t custom-pytorch:22.08 .docker run -p 8888:8888 --gpus all -it --rm custom-pytorch:22.08

Once inside the container, we can run Jupyter:

jupyter notebook --port=8888 --ip=0.0.0.0 --allow-root --no-browser .

And access it on localhost<or VM ip>:8888 . The token to login will be shown on the terminal.

Step 2: Run conversion scripts

First, we need to upload all the scripts located here to Jupyter.

Next, download jupyter notebook from this repo and upload it to Jupyter. This notebook contains the commands which are to be executed to convert the model. Make sure to run the cells in order and make necessary changes to the scripts as suggested in the notebook.

While running the cells, you will see warning logs of missing operators for Onnx conversion. It is normal. Onnx currently doesn’t support these operators and that’s why we cannot directly use the Onnx model but instead, create a TensorRT engine which contains these operators.

The generated trt file can now be used for inference.

Step 3 (Optional): TensorRT Optimizations

Now that our model has been converted to TensorRT, we can optimize our model for better throughput.

Change precision

TensorRT gives us the option to change network precision level, which by default is FP32. There are a few options to choose from and the choice depends on whether lowering the precision distorts the model’s prediction or not. For example, add --fp16 flag to tensorrt conversion command in the notebook:

!trtexec --onnx=converted.onnx --saveEngine=engine.trt --useCudaGraph --fp16

This will change precision to FP16, which can increase the throughput of the model multifold.

Performance summary with fp32 (on NVIDIA T4)
Performance summary with fp16 (on NVIDIA T4)

Change batch size

By default the model has a batch size of 1. Depending on the usecase and available resources, we can change the model’s batch size. This can be done by adding --batch_size flag to Caffe2 to Onnx conversion step:

!python create_onnx.py --onnx converted.onnx --exported_onnx model.onnx --det2_config config.yml -s new.jpg --det2_weights region.pth --batch_size 4

Step 4: Deploy TensorRT engine to Triton Inference Server

The converted TensorRT model can now be deployed on Triton using the tensorrt_plan backend. Rename the .trt file to model.plan as expected by Triton. Model repository structure should be as follows:

<model-repo (model-repository-path)>/
<maskrcnn (model-name)>/
[config.pbtxt]
<1 (version)>/
<model.plan (model-file)>
...
...

Below is a sample config.pbtxt :

The Triton Server version used here is same as that of pytorch image, i.e. 22.08 . Other versions might not contain the operators needed by TensorRT.

To load the converted model on Triton locally, run the following command:

docker run --gpus all --rm -p8001:8001 -p8000:8000 -p8002:8002 -v  <full-path-to-model-repo>:/models nvcr.io/nvidia/tritonserver:22.08-py3 tritonserver --model-repository=/models
Mask generated by Detectron2

Triton can also be deployed on Kubernetes as per this doc.

Conclusion

Detectron2 makes Object Detection simple and quick. But this simplicity is only useful if it can be deployed with scale and performance :)

Authors: Swapnesh Khare, Senior ML Engineer @ CARS24, Archit Jain, Data Scientist @ CARS24

References

  1. Notebook
  2. Official conversion doc
  3. Triton on GKE

--

--