Accelerate your Deep Learning Pipeline with NVIDIA Toolkit

Raj Prasanna Ponnuraj
Analytics Vidhya
Published in
6 min readSep 9, 2020
Picture Courtesy The New York Times

Any deep learning model has two phases — training and inference. Both the phases are as important as the other. The training phase is an iterative process — we iterate to find optimal hyper-parameters, optimal neural network architecture, model refresh and the list goes one. This iterative process is compute intensive and time consuming.

On the other hand, the deployed model should serve millions of requests with low latency. Also, in real world scenario it’s a bunch of models, not one single model, act up on the user requests to produce the desired response. For instance, in case of a voice assistant the speech recognition, the natural language processing and the speech synthesis models work one after the other in sequence. Hence, it is very important that our deep learning pipeline optimally utilizes all the available compute resources to make both the phases efficient by all means.

Graphic Processing Units (GPU) are the most efficient compute resources for parallel processing. They are massively parallel with their thousands of CUDA cores and hundreds of Tensor cores. It is up to the user to best utilize the available GPU resource to make the pipeline efficient. This article discusses four tools from NVIDIA toolkit that can seamlessly integrate in to your deep learning pipeline making it more efficient.

  1. DAta loading LIbrary — DALI
  2. Purpose built pre-trained models
  3. TensorRT
  4. Triton Inference Server

DAta loading LIbrary — DALI

As discussed earlier the GPU is a huge compute engine but data has to be fed to the processing cores at the same rate as they are processed. Only then the GPUs are optimally utilized, else the GPU cores have to wait for data thereby making the it under utilized.

The figure below depicts a typical deep learning pipeline.

Picture Courtesy Microsoft

Conventionally, the data loading and pre-processing is done by the CPU and the pre-processed data is fed into the GPU for training. With the CPU core to GPU ratio increasing by the day, the CPUs are not able to keep up the pace to pre-process the data and feed the GPUs for training thereby creating a bottleneck.

This is where DALI comes to the rescue. Only the data loading part is done by the CPU whereas the pre-processing and augmentation of data are done by the GPU.

Picture Courtesy NVIDIA
class GPUPipeline(Pipeline):
def __init__(self, batch_size, num_threads, device_id):
super(RandomRotatedGPUPipeline, self).__init__(batch_size, num_threads, device_id, seed = 12)
self.input = ops.FileReader(file_root = image_dir, random_shuffle = True, initial_fill = 21)
self.decode = ops.ImageDecoder(device = 'cpu', output_type = types.RGB)
self.rotate = ops.Rotate(device = "gpu")
self.rng = ops.Uniform(range = (-10.0, 10.0))

def define_graph(self):
jpegs, labels = self.input()
images = self.decode(jpegs)
angle = self.rng()
rotated_images = self.rotate(images.gpu(), angle = angle)
return (rotated_images, labels)

In the above code snippet, data load and decode is done by the CPU whereas data augmentation (Rotate in this example) is done by GPU.

Picture Courtesy NVIDIA

As expected, the above benchmark proves that DALI can bring in a huge performance boost as the CPU to GPU ratio increases. In DGX-1 there are 5 CPU cores per GPU where as it’s only 3 CPU cores per GPU in DGX-2.

Purpose built pre-trained models

Transfer learning is a technique in which a deep learning model trained on a huge data set like ImageNet can be used for other applications with minimal data set and minimal training. Models like YOLO and BERT are trained on a huge data set using huge compute clusters. With transfer learning, these pre-trained models can be easily adopted for our application with minimal architectural changes.

NVIDIA has taken this concept of transfer learning to the next level with purpose built pre-trained models available in NVIDIA GPU Cloud (NGC). Few such models are,

Smart city models

  1. DashCamNet
  2. FaceDetect
  3. PeopleNet
  4. TrafficCamNet
  5. VehicleTypeNet

Health sciences models

  1. Brain Tumor Segmentation
  2. Liver and Tumor Segmentation
  3. Spleen Segmentation
  4. Chest X-ray classification

TensorRT

Deep Learning has a wide range of applications such as Self Driving cars, Aerial Surveillance, Real Time Face Recognition solutions, Real Time Language Processing solutions to name a few. But there is only one similarity among these applications. REAL TIME. Considering the need for real time performance (throughput) of these models, we need to optimize the trained model so that it is lite but provides close to training accuracy.

TensorRT is a Deep Learning Inference platform from NVIDIA. It is built on NVIDIA CUDA programming model which helps us leverage the massive parallel performance offered by NVIDIA GPUs. Deep Learning models from almost all popular frameworks can be parsed and optimized for low latency and high throughput inference on NVIDIA GPUs using TensorRT.

Picture Courtesy NVIDIA

With TensorRT, we can do various optimizations effortlessly. The following are few important optimizations that can be done using TensorRT.

1. Mixed Precision Inference

2. Layer Fusion

3. Batching

My observations using TensorRT engine in FP16 precision

1. There wasn’t much of a performance gain when my model was very shallow. For instance, with just 6 convolution layers followed by respective ReLU activation layers and a couple of fully connected layers I couldn’t see any considerable throughput gain.

2. When I increased the depth of the model to 20 convolution layers followed by respective ReLU layers and a couple of fully connected layers, I was able to see solid performance gains. I noticed a 3x gain in throughput.

3. The TensorRT engine was approximately half the size of the native PyTorch model as expected.

Triton Inference Server

Now that we have a whole bunch of trained models, we need to deploy these models in such a way that they serve millions of user requests with low latency possible. Adding to the complexity, the models may not come from the same framework.

Triton Inference Server provides inference service via HTTP/REST or GRPC endpoint with the following advantages,

  1. Multiple framework support
  2. Concurrent model execution
  3. Model ensemble support
  4. Multi GPU support
  5. Batch processing of inputs
Triton Inference Sever Architecture

The trained models should be placed in the model repository. When an inference request arrives for a model, an execution instance is created for the model and Triton Inference Server schedules them to be processed by the underlying hardware. If multiple inference requests arrive for the same model either they can be serviced serially or multiple execution instances of the same model will be created and serviced in parallel.

Multiple models or multiple instance of the same model can be serviced in parallel by a single GPU with hardware scheduling. Also, multi GPU support of Triton helps scaling out of compute resource thereby helping the user meet the latency and throughput demands during deployment.

In the past decade deep learning has evolved a lot because of more data, powerful compute and intelligent model architectures. In this journey NVIDIA has played a major role through their GPU compute and intelligent tools and libraries. Data preparation, training and inference — all the three blocks of the deep learning pipeline are accelerated with these four tools from NVIDIA.

Reference

  1. https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html
  2. https://developer.nvidia.com/transfer-learning-toolkit
  3. https://developer.nvidia.com/tensorrt
  4. https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/

--

--