Convert Detectron2 model to TensorRT
This post covers the steps needed to convert a Detectron2 (MaskRCNN) model to TensorRT format and deploy it on Triton Inference Server.
By default, there is no easy way to deploy a Detectron2 model through Triton. There is no backend as such in Triton which supports this. The workaround is to use python
backend in config.pbtxt
and write your own python code to load the model and predict, which runs on Triton. While this is somewhat an easy task, as of Triton 22.08
, we cannot have more than 1 such models deployed through Triton on Kubernetes (issue).
To resolve this, we need to convert our Detectron2 model to TensorRT and use tensorrt_plan
backend. The blog follows this tutorial, but with easier setup, optimizations and detailed steps to help others avoid getting stuck :)
The scripts required for the conversion can be found in the same tutorial repository.
Prerequisites
- Trained Detectron2 MaskRCNN model (.pth/.pkl file) and model
config.yaml
- The model is trained on square images. The conversion scripts currently only support square image dimensions (ref)
- An example image to adjust dimensions of tensors
- Docker installed on system with GPU (change docker run command for CPU only system)
Step 1: Create Docker image to run conversion scripts
We first need to create a docker image with required packages to run the conversion scripts. Some PyTorch and TensorRT dependecies are needed, for which we can use thepytorch
image from NVIDIA NGC and install other dependencies over it.
We have to use onnx==1.8.1
as onnx.optimizer
is moved to another repo since 1.9
. Create the custom image:
docker build -t custom-pytorch:22.08 .docker run -p 8888:8888 --gpus all -it --rm custom-pytorch:22.08
Once inside the container, we can run Jupyter:
jupyter notebook --port=8888 --ip=0.0.0.0 --allow-root --no-browser .
And access it on localhost<or VM ip>:8888
. The token
to login will be shown on the terminal.
Step 2: Run conversion scripts
First, we need to upload all the scripts located here to Jupyter.
Next, download jupyter notebook from this repo and upload it to Jupyter. This notebook contains the commands which are to be executed to convert the model. Make sure to run the cells in order and make necessary changes to the scripts as suggested in the notebook.
While running the cells, you will see warning logs of missing operators for Onnx conversion. It is normal. Onnx currently doesn’t support these operators and that’s why we cannot directly use the Onnx model but instead, create a TensorRT engine which contains these operators.
The generated trt
file can now be used for inference.
Step 3 (Optional): TensorRT Optimizations
Now that our model has been converted to TensorRT, we can optimize our model for better throughput.
Change precision
TensorRT gives us the option to change network precision level, which by default is FP32. There are a few options to choose from and the choice depends on whether lowering the precision distorts the model’s prediction or not. For example, add --fp16
flag to tensorrt conversion command in the notebook:
!trtexec --onnx=converted.onnx --saveEngine=engine.trt --useCudaGraph --fp16
This will change precision to FP16, which can increase the throughput of the model multifold.
Change batch size
By default the model has a batch size of 1. Depending on the usecase and available resources, we can change the model’s batch size. This can be done by adding --batch_size
flag to Caffe2 to Onnx conversion step:
!python create_onnx.py --onnx converted.onnx --exported_onnx model.onnx --det2_config config.yml -s new.jpg --det2_weights region.pth --batch_size 4
Step 4: Deploy TensorRT engine to Triton Inference Server
The converted TensorRT model can now be deployed on Triton using the tensorrt_plan
backend. Rename the .trt
file to model.plan
as expected by Triton. Model repository structure should be as follows:
<model-repo (model-repository-path)>/
<maskrcnn (model-name)>/
[config.pbtxt]
<1 (version)>/
<model.plan (model-file)>
...
...
Below is a sample config.pbtxt
:
The Triton Server version used here is same as that of pytorch
image, i.e. 22.08
. Other versions might not contain the operators needed by TensorRT.
To load the converted model on Triton locally, run the following command:
docker run --gpus all --rm -p8001:8001 -p8000:8000 -p8002:8002 -v <full-path-to-model-repo>:/models nvcr.io/nvidia/tritonserver:22.08-py3 tritonserver --model-repository=/models
Triton can also be deployed on Kubernetes as per this doc.
Conclusion
Detectron2 makes Object Detection simple and quick. But this simplicity is only useful if it can be deployed with scale and performance :)
Authors: Swapnesh Khare, Senior ML Engineer @ CARS24, Archit Jain, Data Scientist @ CARS24