Speeding Deep Learning inference by upto 20X

If your engineering team is not using Nvidia TRT for your deep learning model deployment then you should stop everything and read this article.

5 min readFeb 19, 2023

Convert your trained Keras, Tensorflow, PyTorch models to ONNX & TensorRT format to infer at lightening speed on GPU. This post will look into this with an example.

Introduction

Deep learning has become the standard modelling technique in computer vision and language problems. Almost everyone has focused on training/developing these model for their usecase but hardly focused on optimal deploying methods. Data scientist/engineers use the same training library (PyTorch, Tensorflow etc) for deployment which are best suited for training but not for inference deployment.

To save huge cost and increase inference speed one should use Nvidia TensorRT (TRT) for deployment. Yes, the same model which has been trained using Tensorflow/PyTorch needs to be converted to TRT format and then deployed on GPU. Below table shows the speed gain using TRT.

Inference speed comparision of ResNet50 Tensorflow (TF) model (FP16) using CPU with TF, GPU with TF and GPU with TRT. 18X gain with TRT as compared to TF-GPU. (credits at the bottom)

Example

We will now see an example to convert TF trained model to TRT and then do inference with TRT engine.

Requirements

First install the required libraries

pip install tensorflow-gpu==2.4.0

#Other requirements
pip install -U tf2onnx==1.8.2 pycuda

Step1: Converting the model to .pb

The first step is to convert the model to a .pb file. The following code example converts the ResNet-50 model to a .pb file:

import tensorflow as tf
from tensorflow.keras.models import Model

# Make sure this is folder path with no extension
save_path = '/path/to/save_model_folder'

# Load the ResNet-50 model pretrained on imagenet
model = tf.keras.applications.resnet.ResNet50(include_top=True, weights='imagenet', input_tensor=None, input_shape=None, pooling=None, classes=1000)

# Convert the Keras ResNet-50 model to a .pb file
# saved_model.pb will be saved at save_path
model.save(save_path)

Step2: Converting the .pb file to ONNX

The second step is to convert the .pb model to the ONNX format. To do this, first install tf2onnx.

After installing tf2onnx, there are two ways of converting the model from a .pb file to the ONNX format. The first way is to use the command line and the second method is by using Python API. Run the following command:

python -m tf2onnx.convert  --saved-model /path/to/saved_model  --output resnet50.onnx

Step3: Creating the TensorRT engine from ONNX

To create the TensorRT engine from the ONNX file, first save the below code into engine.py file :

# filename - engine.py
import tensorrt as trt

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
def build_engine(onnx_path, shape = [1,224,224,3]):

   """
   This is the function to create the TensorRT engine
   Args:
      onnx_path : Path to onnx_file. 
      shape : Shape of the input of the ONNX file. 
  """
   with trt.Builder(TRT_LOGGER) as builder, builder.create_network(1) as network, builder.create_builder_config() as config, trt.OnnxParser(network, TRT_LOGGER) as parser:
       config.max_workspace_size = (256 << 20)
       with open(onnx_path, 'rb') as model:
           parser.parse(model.read())
       network.get_input(0).shape = shape
       engine = builder.build_engine(network, config)
       return engine

def save_engine(engine, file_name):
   buf = engine.serialize()
   with open(file_name, 'wb') as f:
       f.write(buf)
def load_engine(trt_runtime, plan_path):
   with open(plan_path, 'rb') as f:
       engine_data = f.read()
   engine = trt_runtime.deserialize_cuda_engine(engine_data)
   return engine

Now run the following command to create TRT engine and save it

import engine as eng # import from same engine.py file
import argparse
from onnx import ModelProto
import tensorrt as trt 

engine_name = “resnet50.plan”
onnx_path = "/path/to/onnx/result/file/" # e.g. resnet50.onnx
batch_size = 1 
 
model = ModelProto()
with open(onnx_path, "rb") as f:
  model.ParseFromString(f.read())

d0 = model.graph.input[0].type.tensor_type.shape.dim[1].dim_value
d1 = model.graph.input[0].type.tensor_type.shape.dim[2].dim_value
d2 = model.graph.input[0].type.tensor_type.shape.dim[3].dim_value
shape = [batch_size , d0, d1 ,d2]
engine = eng.build_engine(onnx_path, shape= shape)
eng.save_engine(engine, engine_name)

Step4: Running inference from the TensorRT engine

Save the below code in inference.py file. this will be used to do inference using the saved engine file above.

# inference.py
import tensorrt as trt
import pycuda.driver as cuda
import numpy as np
import pycuda.autoinit 

def allocate_buffers(engine, batch_size, data_type):

   """
   This is the function to allocate buffers for input and output in the device
   Args:
      engine : The path to the TensorRT engine. 
      batch_size : The batch size for execution time.
      data_type: The type of the data for input and output, for example trt.float32. 
   
   Output:
      h_input_1: Input in the host.
      d_input_1: Input in the device. 
      h_output_1: Output in the host. 
      d_output_1: Output in the device. 
      stream: CUDA stream.

   """

   # Determine dimensions and create page-locked memory buffers (which won't be swapped to disk) to hold host inputs/outputs.
   h_input_1 = cuda.pagelocked_empty(batch_size * trt.volume(engine.get_binding_shape(0)), dtype=trt.nptype(data_type))
   h_output = cuda.pagelocked_empty(batch_size * trt.volume(engine.get_binding_shape(1)), dtype=trt.nptype(data_type))
   # Allocate device memory for inputs and outputs.
   d_input_1 = cuda.mem_alloc(h_input_1.nbytes)

   d_output = cuda.mem_alloc(h_output.nbytes)
   # Create a stream in which to copy inputs/outputs and run inference.
   stream = cuda.Stream()
   return h_input_1, d_input_1, h_output, d_output, stream 

def load_images_to_buffer(pics, pagelocked_buffer):
   preprocessed = np.asarray(pics).ravel()
   np.copyto(pagelocked_buffer, preprocessed) 

def do_inference(engine, pics_1, h_input_1, d_input_1, h_output, d_output, stream, batch_size, height, width):
   """
   This is the function to run the inference
   Args:
      engine : Path to the TensorRT engine 
      pics_1 : Input images to the model.  
      h_input_1: Input in the host         
      d_input_1: Input in the device 
      h_output_1: Output in the host 
      d_output_1: Output in the device 
      stream: CUDA stream
      batch_size : Batch size for execution time
      height: Height of the output image
      width: Width of the output image
   
   Output:
      The list of output images

   """
   
   load_images_to_buffer(pics_1, h_input_1)

   with engine.create_execution_context() as context:
       # Transfer input data to the GPU.
       cuda.memcpy_htod_async(d_input_1, h_input_1, stream)

       # Run inference.
       context.execute(batch_size=1, bindings=[int(d_input_1), int(d_output)])

       # Transfer predictions back from the GPU.
       cuda.memcpy_dtoh_async(h_output, d_output, stream)
       # Synchronize the stream
       stream.synchronize()

       return h_output

Step5: Inference

Its time to use our TRT model for inferencing. We will use image from the internet for this purpose (download it and save it to a path). This is a image of the gold fish. The Imagenet dataset has 1000 classes and goldfish has class id of 1. Since are using resnet model pretrained on imagenet classes, TRT model should be able to predict it with high confidence. Run the below code for inference.

import engine as eng
import inference as inf
import tensorrt as trt 
import numpy as np
import cv2
from tensorflow.keras.applications.resnet50 import preprocess_input

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
trt_runtime = trt.Runtime(TRT_LOGGER)
input_file_path = '/path/to/goldfish.jpg'
serialized_plan_fp32 = "/path/to/resnet50.plan"
HEIGHT = 224
WIDTH = 224

image = cv2.imread(input_file_path)
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
image = cv2.resize(image, (224,224)).astype(np.float32)
image = preprocess_input(image)

engine = eng.load_engine(trt_runtime, serialized_plan_fp32)
h_input, d_input, h_output, d_output, stream = inf.allocate_buffers(engine, 1, trt.float32)
output = inf.do_inference(engine, image, h_input, d_input, h_output, d_output, stream, 1, HEIGHT, WIDTH)
print("Probability of image having goldfish", output[1])

Final words

On T4 GPU, TRT gives 6X speed gain over standard tensorflow inference (model.predict). This is huge as the latency of the deployment is reduced by large factor and helps in reducing significant cost. Speed gain is dependent of the architecture of the models, precision and various other parameters. Some model gives upto 40X gain and some given only 3X gains.

One limitation to TRT currently is that not all model is straight away convertable to TRT. One need some change in the model files to convert it to TRT. In case anyone of you face any problem with the conversion then you can contact me by leaving a message below.

Credits: Top image chart and some code have been taken from the official NVIDIA websites.

Model Training Tricks: Check out my post on Neural network training tricks. You will be amazed to see the result when you implement any of it.

Neural Network— Must know Model Training Tricks

Training neural network is not an easy task. This post covers many important Tricks which can help you train better…

medium.com