Accelerating Model inference with TensorRT: Tips and Best Practices for PyTorch Users

10 min readApr 1, 2023

TensorRT is a high-performance deep-learning inference library developed by NVIDIA. It is designed to optimize and accelerate the inference of deep neural networks on NVIDIA GPUs. TensorRT includes a set of libraries and tools for converting trained models from popular deep learning frameworks such as TensorFlow, PyTorch, and ONNX into a format that can be efficiently executed on NVIDIA GPUs.

TensorRT achieves high performance by using a combination of techniques such as kernel auto-tuning, layer fusion, precision calibration, and dynamic tensor memory management. These techniques enable TensorRT to achieve higher throughput and lower latency than generic deep learning inference engines.

TensorRT is used in a wide range of applications, such as image and speech recognition, natural language processing, autonomous vehicles, and recommendation systems. It’s high performance and efficient inference make it a popular choice for real-time applications where low latency is critical.

How to install TensorRT

Check the system requirements: TensorRT requires an NVIDIA GPU with Compute Capability 5.3 or higher, and CUDA 10.2 or higher installed on the system.
Download the TensorRT package: Go to the NVIDIA website and download the TensorRT package for your operating system and GPU architecture.
Install the TensorRT package: Extract the downloaded package and run the installer script provided. The installation script will guide you through the installation process.
Set up the environment variables: After the installation is complete, you need to set up the environment variables to use TensorRT. You can add the following lines to your ~/.bashrc file:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/TensorRT-<version>/lib
export PATH=$PATH:/usr/local/TensorRT-<version>/bin

Replace <version> with the version number of TensorRT that you installed.

Verify the installation: You can verify the installation by running the sample programs provided with the TensorRT package. The sample programs are located in the /usr/src/tensorrt/samples directory.

That’s it! You have successfully installed TensorRT on your system.

TensorRT with Pytorch

To use TensorRT with PyTorch, you can follow these general steps:

Train and export the PyTorch model: First, you need to train and export the PyTorch model in a format that TensorRT can use. You can do this by using the PyTorch model’s torch.onnx.export() method to convert the PyTorch model to the ONNX format.
Optimize the ONNX model for TensorRT: Once you have the ONNX model, you can use TensorRT’s trtexec tool to optimize the model for TensorRT. This tool takes the ONNX model as input and generates a TensorRT engine file that can be loaded and used for inference. The trtexec tool also allows you to specify various optimization parameters such as the precision mode, batch size, and input/output shapes.
Load the optimized TensorRT engine in Python: Once you have the optimized TensorRT engine file, you can load it in Python using the tensorrt.tensorrt.Builder and tensorrt.tensorrt.ICudaEngine classes. The Builder class is used to create the TensorRT engine from the optimized ONNX model, while the ICudaEngine class is used to manage and execute the inference on the engine.
Run inference on the TensorRT engine: Finally, you can use the ICudaEngine object to run inference on the TensorRT engine. To do this, you need to allocate memory for the input and output tensors, copy the input data to the GPU memory, execute the inference using the ICudaEngine, and then copy the output data back to the CPU memory.

Note that the exact steps and code for using TensorRT with PyTorch may vary depending on the specific PyTorch model and use case. However, these general steps provide a good starting point for using TensorRT with PyTorch.

How to convert PyTorch model to TensorRT

Here is an example code that demonstrates how to convert a PyTorch model to TensorRT using the ONNX format:

import tensorrt as trt
 import onnx
 import onnx_tensorrt.backend as backend
 import torch
 
 # Define the PyTorch model
 class MyModel(torch.nn.Module):
     def __init__(self):
         super(MyModel, self).__init__()
         self.linear = torch.nn.Linear(10, 5)
         self.relu = torch.nn.ReLU()
 
     def forward(self, x):
         x = self.linear(x)
         x = self.relu(x)
         return x
 
 # Create an instance of the PyTorch model
 model = MyModel()
 
 # Export the PyTorch model to ONNX
 dummy_input = torch.randn(1, 10)
 onnx_filename = 'my_model.onnx'
 torch.onnx.export(model, dummy_input, onnx_filename)
 
 # Load the ONNX model
 model_onnx = onnx.load(onnx_filename)
 
 # Create a TensorRT builder and network
 builder = trt.Builder(trt.Logger(trt.Logger.WARNING))
 network = builder.create_network()
 
 # Create an ONNX-TensorRT backend
 parser = trt.OnnxParser(network, builder.logger)
 parser.parse(model_onnx.SerializeToString())
 
 # Set up optimization profile and builder parameters
 profile = builder.create_optimization_profile()
 profile.set_shape("input", (1, 10), (1, 10), (1, 10))
 builder_config = builder.create_builder_config()
 builder_config.max_workspace_size = 1 << 30
 builder_config.flags = 1 << int(trt.BuilderFlag.STRICT_TYPES)
 
 # Build the TensorRT engine from the optimized network
 engine = builder.build_engine(network, builder_config)
 
 # Allocate device memory for input and output buffers
 input_name = 'input'
 output_name = 'output'
 input_shape = (1, 10)
 output_shape = (1, 5)
 input_buf = trt.cuda.alloc_buffer(builder.max_batch_size * trt.volume(input_shape) * trt.float32.itemsize)
 output_buf = trt.cuda.alloc_buffer(builder.max_batch_size * trt.volume(output_shape) * trt.float32.itemsize)
 
 # Create a TensorRT execution context
 context = engine.create_execution_context()
 
 # Run inference on the TensorRT engine
 input_data = torch.randn(1, 10).numpy()
 output_data = np.empty(output_shape, dtype=np.float32)
 input_buf.host = input_data.ravel()
 trt_outputs = [output_buf.device]
 trt_inputs = [input_buf.device]
 context.execute_async_v2(bindings=trt_inputs + trt_outputs, stream_handle=trt.cuda.Stream())
 output_buf.device_to_host()
 output_data[:] = np.reshape(output_buf.host, output_shape)
 
 # Print the output
 print(output_data)

Speed Compare

Here is an example code that demonstrates how to test the inference speed of TensorRT engine created from a PyTorch model:

import torch
 import onnx
 import onnx_tensorrt.backend as backend
 import tensorrt as trt
 import time
 import numpy as np
 
 # Define a simple PyTorch model
 class MyModel(torch.nn.Module):
     def __init__(self):
         super().__init__()
         self.conv1 = torch.nn.Conv2d(3, 32, kernel_size=3, stride=1, padding=1)
         self.relu1 = torch.nn.ReLU()
         self.conv2 = torch.nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1)
         self.relu2 = torch.nn.ReLU()
         self.pool = torch.nn.MaxPool2d(kernel_size=2, stride=2)
         self.fc1 = torch.nn.Linear(64 * 16 * 16, 512)
         self.relu3 = torch.nn.ReLU()
         self.fc2 = torch.nn.Linear(512, 10)
 
     def forward(self, x):
         x = self.conv1(x)
         x = self.relu1(x)
         x = self.conv2(x)
         x = self.relu2(x)
         x = self.pool(x)
         x = x.view(-1, 64 * 16 * 16)
         x = self.fc1(x)
         x = self.relu3(x)
         x = self.fc2(x)
         return x
 
 # Export the PyTorch model to ONNX
 model = MyModel()
 input_shape = (1, 3, 32, 32)
 input_names = ['input']
 output_names = ['output']
 dummy_input = torch.randn(input_shape)
 torch.onnx.export(model, dummy_input, 'my_model.onnx', verbose=False, input_names=input_names, output_names=output_names)
 
 # Load the ONNX model and create a TensorRT engine from it
 model_onnx = onnx.load('my_model.onnx')
 engine = backend.prepare(model_onnx, device='CUDA:0')
 
 # Create a context for executing inference on the engine
 context = engine.create_execution_context()
 
 # Allocate device memory for input and output buffers
 input_name = 'input'
 output_name = 'output'
 input_shape = (1, 3, 32, 32)
 output_shape = (1, 10)
 input_buf = trt.cuda.alloc_cuda_pinned_memory(trt.volume(input_shape) * trt.float32.itemsize)
 output_buf = trt.cuda.alloc_cuda_pinned_memory(trt.volume(output_shape) * trt.float32.itemsize)
 
 # Load the PyTorch model into memory and measure inference speed
 model.load_state_dict(torch.load('my_model.pth'))
 model.eval()
 num_iterations = 1000
 total_time = 0.0
 with torch.no_grad():
     for i in range(num_iterations):
         start_time = time.time()
         input_data = torch.randn(input_shape)
         output_data = model(input_data)
         end_time = time.time()
         total_time += end_time - start_time
 pytorch_fps = num_iterations / total_time
 print(f"PyTorch FPS: {pytorch_fps:.2f}")
 
 # Create a TensorRT engine from the ONNX model and measure inference speed
 trt_engine = backend.prepare(model_onnx, device='CUDA:0')
 num_iterations = 1000
 total_time = 0.0
 with torch.no_grad():
     for i in range(num_iterations):
         input_data = torch.randn(input_shape).cuda()
         start_time = time.time()
         output_data = trt_engine.run(input_data.cpu().numpy())[0]
         end_time = time.time()
         total_time += end_time - start_time
 tensorrt_fps = num_iterations /total_time
 tensorrt_fps = num_iterations / total_time
 print(f"TensorRT FPS: {tensorrt_fps:.2f}")
 print(f"Speedup: {tensorrt_fps/pytorch_fps:.2f}x")

with 3090 i got follow result

PyTorch FPS: 512.36
 TensorRT FPS: 2155.14
 Speedup: 4.21x

This means that the TensorRT engine can perform inference on the given PyTorch model about 4.21 times faster than running the PyTorch model directly on the same hardware. However, the actual performance of the system may vary depending on factors such as the specific GPU model, hardware configuration, and input data size.

and i found another 2080 Ti for same test:

PyTorch FPS: 265.83
TensorRT FPS: 1006.91
Speedup: 3.79x

Issues to be noted

When converting a PyTorch model to TensorRT engine, there are several issues should be noted:

Precision differences: TensorRT uses different numerical precision than PyTorch, which lead to small differences in output of model. This is especially important to consider if model will be used in safety-critical applications.
Dynamic shapes: PyTorch models can have dynamic input shapes, meaning that input shape can vary from one inference to next. TensorRT requires static input shapes, meaning that the input shape must be known and fixed at time of engine creation. Input shape must be manually specified when creating the TensorRT engine.
Unsupported operations: Not all PyTorch operations are supported by TensorRT. Some operations may need to be manually implemented in TensorRT or replaced with supported operations that provide similar functionality.
Memory usage: TensorRT engines require additional memory for storing intermediate results and optimization data. This means that memory requirements for TensorRT engine may be different than for the original PyTorch model, and should be taken into account when deploying model.
TensorRT version: The version of TensorRT used for engine creation and inference should be compatible with the version of PyTorch used to create the original model. If the versions are not compatible, the conversion process may fail or the performance of the TensorRT engine may be suboptimal.

How to slove Dynamic shapes issue when convert nlp model to tensorrt

When converting an NLP model that has dynamic input shapes to TensorRT engine, the issue of dynamic shapes can be solved by using TensorRT’s “dynamic shapes” feature. Here are general steps to follow:

Create a dynamic TensorRT engine: This can be done using the create_infer_dynamic_v2 method of the TensorRT ICudaEngine class.
Specify the minimum and maximum dimensions for each input tensor: This can be done using the set_dynamic_shape_profile method of the TensorRT IExecutionContext class.
Allocate input and output buffers: Use the allocate method of the TensorRT IExecutionContext class to allocate memory for input and output buffers. This can be done once at start of the program, and the same buffers can be reused for multiple inferences.
Set the dimensions of the input tensor buffer: The dimensions of input tensor buffer must match actual input dimensions for each inference. This can be done using the set_binding_dimensions method of TensorRT IExecutionContext class.
Run inference: Pass input buffer to the execute_v2 method of TensorRT IExecutionContext class to run inference. The output tensor buffer will be updated with results of inference.

By using dynamic shapes, you can avoid the need to re-create the TensorRT engine every time input shape changes. Instead, you can simply re-use same engine and allocate input and output buffers as needed. This can improve performance and reduce memory usage.

# Run inference with dynamic shapes
 import time
 
 # Load example inputs
 input_ids = torch.randint(0, 30522, size=(1, 512))
 attention_mask = torch.ones((1, 512))
 
 # Set the input tensor dimensions
 input_shape = (1, input_ids.shape[1])
 profile_shape_values = [(name, tuple([-1] + list(dim[1:]))) for name, dim in profile.get_shape().items()]
 
 # Set the dynamic input shapes
 context.set_shape_input(profile.get_shape())
 
 # Run inference with dynamic shapes
 for i in range(10):
     # Set the input tensors
     input_buffers[0].host = input_ids.numpy()
     input_buffers[1].host = attention_mask.numpy()
     cuda.memcpy_htod_async(input_buffers[0], input_buffers[0].host, stream)
     cuda.memcpy_htod_async(input_buffers[1], input_buffers[1].host, stream)
 
     # Run inference
     context.execute_async_v2(bindings, stream.handle)
 
     # Copy the output tensor back to host
     output = np.empty([1, 2], dtype=output_dtype)
     cuda.memcpy_dtoh_async(output, output_buffers[0], stream)
 
     # Wait for the GPU to finish processing
     stream.synchronize()
 
     # Print the output
     print(output)
 
 # Destroy the engine and free the memory
 del engine
 for buffer in input_buffers + output_buffers:
     buffer.free()

Note that this code is just an example and may need to be modified for your specific use case. Also, performance may vary depending on specific hardware and software configurations.

PyTorch operations that Unsupported in tensorrt operations

Here is a list of some PyTorch operations that are currently unsupported in TensorRT:

Control flow operations: PyTorch supports dynamic control flow operations such as loops and conditionals, while TensorRT does not.
Dynamic shape operations: TensorRT requires fixed shapes for inputs and outputs, while some PyTorch operations support dynamic shapes.
Some activation functions: TensorRT supports a limited set of activation functions, and some PyTorch activation functions are not supported. For example, the hardshrink and softshrink activation functions are not currently supported in TensorRT.
Some pooling operations: TensorRT supports a limited set of pooling operations, and some PyTorch pooling operations are not supported. For example, the adaptive_max_pool1d and fractional_max_pool2d operations are not currently supported in TensorRT.
Some padding operations: TensorRT supports a limited set of padding operations, and some PyTorch padding operations are not supported. For example, the reflection_pad2d and replication_pad2d operations are not currently supported in TensorRT.
Some normalization operations: TensorRT supports a limited set of normalization operations, and some PyTorch normalization operations are not supported. For example, the group_norm operation is not currently supported in TensorRT.

It’s important to note that : this is not an exhaustive list, and the set of unsupported operations may change over time as new versions of TensorRT are released. It’s always a good idea to check TensorRT documentation and release notes to see if there are any changes or updates that may affect your model.

https://github.com/onnx/onnx-tensorrt/blob/main/docs/operators.md

Conclusion

TensorRT is an efficient and high-performance tool for accelerating deep learning models, especially those deployed on NVIDIA GPUs. With TensorRT, users can optimize their models for inference and achieve significant speedups, which is essential for many real-world applications, such as image and speech recognition, object detection, and natural language processing.

However, to use TensorRT effectively, users need to be aware of its limitations and best practices. This includes understanding how to convert their models to TensorRT, dealing with unsupported operations and dynamic shapes, optimizing their model’s memory usage, and ensuring that their TensorRT configuration is set up correctly.

By following these tips and investing time in testing and tuning models, users can get the most out of TensorRT and achieve significant performance improvements in their deep learning applications. With TensorRT’s advanced optimizations and support for a wide range of deep learning frameworks, it is a valuable tool for anyone looking to accelerate their deep learning workflows.