Accelerate PyTorch Model With TensorRT via ONNX

4 min readNov 5, 2019

TensorRT optimization pipeline for inference

PyTorch is one of the most popular deep learning network frameworks due to its simplicity and flexibility with its dynamic computation graph design. But compared with static computation graph deep learning framework like TensorFlow, PyTorch lacks abundant tools to accelerate model for faster inference since it takes more effort to optimize a dynamic computation graph (but anyway, PyTorch version 1.3 brings a lot of new features to ease deployment, such as quantization and mobile toolkit). Luckily, Nvidia has provided us a high performance inference framework on GPUs which support a bunch of mainstream deep learning networks like TensorFlow, Caffe, ONNX, etc. By converting the PyTorch model to ONNX first, we could boost the model inference speed when running TensorRT with ONNX backend.

1. Setting up the ONNX-TensorRT ENV

I prefer to run the code in docker container, which is an independent running environment that will help you get rid of many annoying environment problems. The onnx_tensorrt git repository has given us the dockerfile for building. First you need to pull down the repository and download the TensorRT tar or deb file to your host devices.

git clone --recurse-submodules https://github.com/onnx/onnx-tensorrt.git
cd onnx-tensorrt
# install with tar file
cp /path/to/TensorRT-6.0.*.tar.gz .
docker build -f docker/onnx-tensorrt-tar.Dockerfile --tag=onnx-tensorrt:6.0.6 .# or install with deb file
cp /path/to/nv-tensorrt-repo-ubuntu1x04-cudax.x-trt6.x.x.x-ga-yyyymmdd_1-1_amd64.deb .
docker build -f docker/onnx-tensorrt-deb.Dockerfile --tag=onnx-tensorrt:6.0.6 .

2. Converting PyTorch model to ONNX model

Since PyTorch has integrated ONNX toolkit into its library, it’s quite easy to run conversion using PyTroch directly. Here is an example of conversion.

import torchdef load_model_weight(model, model_path):
    model.load_state_dict(torch.load(model_path))
    model = model.eval()
    return modeldef export_onnx_model(model, input_shape, onnx_path, input_names=None, output_names=None, dynamic_axes=None):
    inputs = torch.ones(*input_shape)
    model(inputs)
    torch.onnx.export(model, inputs, onnx_path, input_names=input_names, output_names=output_names, dynamic_axes=dynamic_axes)if __name__ == "__main__":
    from model import IR_50
    model_path = "model_checkpoint.pth"
    model = IR_50([112, 112])
    model = load_model_weight(model, model_path)
    for name, v in model.named_parameters():
        print(name)
    input_shape = (10, 3, 112, 112)
    onnx_path = "test.onnx"
    # input_names=['input']
    # output_names=['output']
    # dynamic_axes={'input': {0: 'batch_size'}, 'output': {0: 'batch_size'}}
    export_onnx_model(model, input_shape, onnx_path)

In this case, I’m converting a ResNet 50 based model to ONNX format. Here I use a fixed batch size as 10. Turn dynamic axes argument on to freeze ONNX with dynamic input including batch size, input width or height, if you want. NOTE the conversion process may suffer conversion errors. Check if all your operations are supported referring to this list and if your PyTorch version is the latest. Based on my experience, updating could solve 99% problems.

Besides, In some cases, converting a model to ONNX will make model more complicated, for instance, the PReLU operation. The vanilla converted ONNX structure looks like this one.

Such a ‘complicated’ model will fail to be loaded in TensorRT. So we need to simplify it first with this tool: onnx-simplifier. After simplifying, the Prelu returns normal state.

3. Running model on TensorRT

Following this TensorRT developer guide step by step, we could run ONNX with TensorRT.

NOTE:

PyCUDA is required for manipulating the CPU and GPU memory to handle input and output tensor.
According to PyCUDA document, GPUArray array class is another way that could accept array directly instead of serialization beforehand.
context.exectue_async is used when asynchronous execution is required to speed up inference when large amounts of requests are proposed simultaneously.

Actually, the onnx-tensorrt repository has wrapped this snippet of code and releases backend API for inference, with a little bit more effort to compile the python module.

4. Performance

We roughly test different numbers of batches to see their average cost time for each inference using onnxruntime-gpu vs TensorRT. As you can see from the following graph, inference with TensorRT is about 10x faster than running with onnxruntime-gpu when batch size is larger than 10.

5. Summary

Till now, we have a brief understanding of the acceleration effect of TensorRT to run a PyTorch model on GPUs. Anyway, it should be granted because GPUs are their own product, so they should be good at optimization on this platform than anyone else. But TensorRT will go far more than this step. Since it’s flexible to support multiple frameworks, language APIs and custom layers, just feel free to experiment it to find the best recipe for you.