Deployment Ready Deep Learning Models with TensorRT

Raj Prasanna Ponnuraj
Analytics Vidhya
Published in
4 min readAug 27, 2020
Photo Courtesy Robotics Business Review

Deep Learning has a wide range of applications such as Self Driving cars, Aerial Surveillance, Real Time Face Recognition solutions, Real Time Language Processing solutions to name a few. But there is only one similarity among these applications. REAL TIME. Considering the need for real time performance (throughput) of these models, we need to optimize the trained model so that it is lite but provides close to training accuracy.

TensorRT is a Deep Learning Inference platform from NVIDIA. It is built on NVIDIA CUDA programming model which helps us leverage the massive parallel performance offered by NVIDIA GPUs. Deep Learning models from almost all popular frameworks can be parsed and optimized for low latency and high throughput inference on NVIDIA GPUs using TensorRT.

Photo Courtesy NVIDIA

With TensorRT, we can do various optimizations effortlessly. The following are few important optimizations that can be done using TensorRT.

1. Mixed Precision Inference

2. Layer Fusion

3. Batching

Mixed Precision Inference

Single Precision Floating Point or FP32 in short is the choice of precision when it comes to Deep Learning training.

Photo Courtesy Wikipedia
Photo Courtesy Wikipedia

FP32 has 8 bits to represent the exponent and 23 bits to represent the fraction which is ideal for all those gradient calculations and updates. During inference if the model gives close to training accuracy and if it is half as heavy as it is during training then we have the advantage of less memory utilization and high throughput. With TensorRT, we can create a production model that is in FP16 precision or INT 8 or INT 4 precision.

Layer Fusion

Before talking about layer fusion let us look at how an instruction is processed. To process an instruction, operands in the memory has to be transferred to the registers, the operation is then carried out by the processor and the results are again copied back to the memory. With this rough idea let us look at layer fusion.

Layer fusion

Layers in most of the Deep Learning models follow a sequence. For example, a Convolution layer is followed by a Batch Normalization layer followed by an Activation layer. Here we have three operations to be performed sequentially. Instead of transferring the data back and forth between memory and registers for each operation, through Layer fusion we transfer data once from memory to registers, perform all the three operations sequentially and transfer back the final result to the memory. By doing this we save four costly data transfer cycles.

Batching

GPUs have thousands of processing cores; it is only that we need to use them efficiently. By planning proper batch size of our input data based on the target platform of deployment, we can optimally leverage the huge number of available cores.

My experiment with TensorRT

I tried experimenting with TensorRT and the results were as good as it is claimed by NVIDIA. I used an NVIDIA GeForce GTX 1650 GPU and PyTorch container from NGC. I’ll write a separate article on NGC later. The PyTorch container comes loaded with all the libraries I needed for my experiment removing all those installation hurdles.

For this experiment I used CIFAR10 dataset and a simple custom-built CNN. Achieving 95%+ accuracy was not the intention of this work. So, I did not focus much on the architecture and hyper-parameters but I was interested in working with TensorRT and experience the performance boost that it gives.

Process Flow

  1. Train a CNN on CIFAR10 dataset
  2. Save the best model in .pth format
  3. Create .onnx version of the saved model
  4. From the ONNX model create a TensorRT engine and save it as a .plan file for reuse
  5. Use the TensorRT engine for high performance Deep Learning Inference

I’ve attached the github link to my Jupyter notebook used for this experiment.

My observations using TensorRT engine in FP16 precision

1. There wasn’t much of a performance gain when my model was very shallow. For instance, with just 6 convolution layers followed by respective ReLU activation layers and a couple of fully connected layers I couldn’t see any considerable throughput gain.

2. When I increased the depth of the model to 20 convolution layers followed by respective ReLU layers and a couple of fully connected layers, I was able to see solid performance gains. I noticed a 3x gain in throughput.

3. The TensorRT engine was approximately half the size of the native PyTorch model as expected.

Thanks for reading. Please leave your constructive comments and follow me for articles on Deep Learning.

--

--