How to Build and Deploy Accurate Deep Learning Models for Intelligent Image and Video Analytics

Published in

DataSeries

12 min readSep 25, 2019

For developing an application based on object detection or classification you’ll need deep learning models, however building these models from scratch is challenging and time-consuming. Training the models carefully with data over time and preserving accuracy is crucial as well. Neural networks learn from data that is stored as “weights” within the neural network. Instead of training newer neural networks from scratch, you can transfer the features learned earlier by extracting these weights and transferring them to another neural network. This is known as transfer learning. Transfer learning is often used to accelerate the training process and it can be critical in certain applications where obtaining a large amount of data is not feasible.

Public and third-party models are not comprehensive and readily optimized to fit your use case and need additional fine-tuning to deploy them in your application. You may also be interested in a class of objects that are different than those included in publicly available models. Most models require incremental retraining given the complexity of use-case and your specific target application.

In this tutorial, you’ll learn how to train a high-performance neural network for object detection using KITTI format with NVIDIA Transfer Learning Toolkit (TLT) and deploy it to any image and video analytics use cases including warehouse management, industrial inspection, smart cities, retail aisles and more. You’ll notice that a pruned ResNet18 network, trained on the KITTI dataset, is 10X smaller and 2 to 3X faster when compared to the unpruned model.

Now let’s build a ResNet18 based 3-class object detector with the KITTI dataset. To get started, the rest of the tutorial will assume you have already set up your NVIDIA NGC account and pulled the container from:

docker pull nvcr.io/nvidia/tlt-streamanalytics:v1.0_py2

Here are some supported models in the NGC model registry that you can get started with.

The tutorial is divided into three main sections:

How to train an object detection model
How to prune a trained model
How to export a model and deploy

How to train an object detection model

KITTI 2D object detection dataset is a popular dataset primarily designed for autonomous driving, which contains 7481 training images and 7518 testing images. TLT takes advantage of the KITTI file format and provides a dataset converter from the KITTI format to TFRecords, which provides faster iterations with data.

Once we download the KITTI dataset, we can use the built-in dataset converter to convert them to TFRecords with the command:

tlt-dataset-convert -d <dataset conversion spec> -o <output tfrecord>

Your dataset conversion spec file should look like this:

kitti_config {
  root_directory_path: “/path/to/kitti/root/”
  image_dir_name: “data_object_image_2/training/image_2”
  label_dir_name: “training/label_2”
  image_extension: “.png”
  partition_mode: “random”
  num_partitions: 2
  val_split: 20
  num_shards: 10
}
image_directory_path: “/path/to/kitti/root/”

Before we can start training, we need to compose a configuration file, where we specify the hyperparameters for our experiment. The experiment configuration file in JSON format consists of 8 key modules: dataset_config, augmentation_config, model_config, training_config, evaluation_config, box_rasterizer_config, postprocessing_config and cost_function_config. For example, in dataset_config, we should update data sources with our aforementioned KITTI TFRecords.

dataset_config {
 data_sources: {
   tfrecords_path: “/path/to/your/tfrecords/*”
   image_directory_path: “/path/to/kitti/root”
 }
 image_extension: “png”
 target_class_mapping {
   key: “car”
   value: “car”
 }
 target_class_mapping {
   key: “pedestrian”
   value: “pedestrian”
 }
 target_class_mapping {
   key: “cyclist”
   value: “cyclist”
 }
 validation_fold: 0
}

The augmentation module provides some basic on the fly data pre-processing and augmentation during training. The following sample demonstrates that the KITTI training pipeline will take images of 384x1248 with simple horizontal flip, basic color and translation augmentation as input.

augmentation_config {
 preprocessing {
   output_image_width: 1248
   output_image_height: 384
   output_image_channel: 3
   min_bbox_width: 1.0
   min_bbox_height: 1.0
 }
 spatial_augmentation {
   hflip_probability: 0.5
   zoom_min: 1.0
   zoom_max: 1.0
   translate_max_x: 8.0
   translate_max_y: 8.0
 }
 color_augmentation {
   hue_rotation_max: 25.0
   saturation_shift_max: 0.2
   contrast_scale_max: 0.1
   contrast_center: 0.5
 }
}

Tip: It’s worth noting that if the output image height and output image width of the preprocessing block don’t match with that of the input images mentioned while generating the TFRecords, the images will be either randomly padded or cropped to fit the input resolution.

Model structure and the related hyperparameters can be configured using the model_config module. Depending on the arch you choose, the hyperparameters for the arch or backbone may vary. In the following example, we choose to use ResNet18 with batch normalization enabled and first convolutional block frozen. With a frozen convolutional layer, the weights do not change in a downstream task — this is especially helpful in transfer learning, where generic features are already encapsulated in the shallow layers. Thus, we not only reuse the learned features, but also reduce training time.

model_config {
 pretrained_model_file: “/path/to/pretrained/model”
 num_layers: 18
 freeze_blocks: 0
 arch: “resnet”
 use_batch_norm: true
 objective_set {
   bbox {
     scale: 35.0
     offset: 0.5
   }
   cov {
   }
 }
 training_precision {
   backend_floatx: FLOAT32
 }
}

In case you may want to use the pre-trained weights we provide in NGC, you can easily access and download them:

ngc registry model list nvidia/iva/tlt*
ngc registry model download-version nvidia/iva/tlt_resnet18_detectnet_v2:1 -d <path_to_download_dir>

The training config module is pretty self-explanatory, where common hyperparameters like batch size, learning rate, regularizer and optimizer get specified.

Tip: It’s a good practice that you start with a low regularization weight and gradually fine-tune it to narrow the gap between the training and the validation accuracy. Also, based on our experiments, L1 seems to give us better pruning ratio, which we will discuss in the following section.

training_config {
 batch_size_per_gpu: 24
 num_epochs: 120
 learning_rate {
   soft_start_annealing_schedule {
     min_learning_rate: 5e-06
     max_learning_rate: 0.0005
     soft_start: 0.1
     annealing: 0.7
   }
 }
 regularizer {
   type: L1
   weight: 3e-09
 }
 optimizer {
   adam {
     epsilon: 9.9e-09
     beta1: 0.9
     beta2: 0.999
   }
 }
 cost_scaling {
   initial_exponent: 20.0
   increment: 0.005
   decrement: 1.0
 }
 checkpoint_interval: 10
}

The evaluator in the training pipe can be configured by using the evalution_config field. The mean Average Precision (mAP) calculation method is based on the PASCAL VOC evaluation methods, see Everingham et al. (2010). It supports 2 modes of AP (Average Precision) calculation, namely ‘sample’ and ‘integrate’ method, as used in the PASCAL VOC 2007 and 2011 challenges respectively. You can specify different Intersection over Union (IoU) for different classes.

evaluation_config {
 average_precision_mode: INTEGRATE
 validation_period_during_training: 10
 first_validation_epoch: 1
 minimum_detection_ground_truth_overlap {
   key: “car”
   value: 0.7
 }
 // similarly for pedestrian and cyclist
 // use “value: 0.5” for both classes 
 // TODO for readers
}

The ground truth generator for DetectNet_v2 generates 2 tensors namely, cov and bbox. The image is divided into 16x16 grid cells. The cov tensor(short for coverage tensor) defines the number of gridcells that are covered by an object. The bbox tensor defines the normalized image coordinates of the object (x1, y1) top left and (x2, y2) bottom right with respect to the grid cell. For best results, we assume the coverage area to be an ellipse within the bbox label, with the maximum confidence being assigned to the cells in the centre and reducing coverage outwards.

bbox_rasterizer_config {
 target_class_config {
   key: “car”
   value {
     cov_center_x: 0.5
     cov_center_y: 0.5
     cov_radius_x: 0.4
     cov_radius_y: 0.4
     bbox_min_radius: 1.0
   }
 }
 target_class_config {
   key: “cyclist”
   value {
     cov_center_x: 0.5
     cov_center_y: 0.5
     cov_radius_x: 1.0
     cov_radius_y: 1.0
     bbox_min_radius: 1.0
   }
 }
 target_class_config {
   key: “pedestrian”
   value {
     cov_center_x: 0.5
     cov_center_y: 0.5
     cov_radius_x: 1.0
     cov_radius_y: 1.0
     bbox_min_radius: 1.0
   }
 }
 deadzone_radius: 0.67
}

The post-processor module generates renderable bounding boxes from the raw detection output. This process involves retaining valid detections by thresholding objects using the confidence value in the coverage tensor and clustering the candidate bounding boxes using a clustering algorithm for each class independently. In DetectNet_v2, Density-based Spatial Clustering of Applications with Noise (DBSCAN) is used while Faster RCNN and SSD use Non Maximum Suppression (NMS).

Tip: keep in mind that the greater the dbscan_eps or ε usually means more boxes are grouped together.

postprocessing_config {
 target_class_config {
   key: “car”
   value {
     clustering_config {
       coverage_threshold: 0.005
       dbscan_eps: 0.13
       dbscan_min_samples: 0.05
       minimum_bounding_box_height: 4
     }
   }
 }
 target_class_config {
   key: “cyclist”
   value {
     clustering_config {
       coverage_threshold: 0.005
       dbscan_eps: 0.15
       dbscan_min_samples: 0.05
       minimum_bounding_box_height: 4
     }
   }
 }
 target_class_config {
   key: “pedestrian”
   value {
     clustering_config {
       coverage_threshold: 0.005
       dbscan_eps: 0.15
       dbscan_min_samples: 0.05
       minimum_bounding_box_height: 4
     }
   }
 }
}

The cost function module helps us configure the cost function to include the classes that we are training for. For each class we want to train, we add a new entry of the target classes to the configuration file. We recommend not changing the parameters for best performance with these classes.

cost_function_config {
 target_classes {
   name: “car”
   class_weight: 1.0
   coverage_foreground_weight: 0.05
   objectives {
     name: “cov”
     initial_weight: 1.0
     weight_target: 1.0
   }
   objectives {
     name: “bbox”
     initial_weight: 10.0
     weight_target: 10.0
   }
 }
 // similarly for pedestrian and cyclist
 // TODO for readers 
}

Great, we have finally completed our experiment config file. To kick off training, simply run:

tlt-train detectnet_v2 --gpus <num GPUs>
                       -r <result directory>
                       -e <spec file>
                       -k <key>

Tip for Multi GPU training at scale

Training with more GPUs allows networks to ingest more data faster, saving precious time during the development process. Transfer Learning Toolkit supports multi-GPU training so users can train the model with several GPUs in parallel. This feature is also helpful for hyper-parameter optimization.

tlt-evaluate detectnet_v2 -e <spec file>
                          -m <model file>
                          -k <key>

All the detection frameworks use mAP as a shared metric and the classification model enables various metrics including Top K Accuracy, Precision and Recall.

Sample output of evaluation is shown below:

Note that lower accuracy for cyclist is expected since KITTI dataset include much fewer cyclists than cars

You can also visualize the predictions by calling tlt-infer. We leave it to the readers to explore the functionalities of tlt-infer in the getting started guide. Without much bells and whistles, we were able to detect most cars and pedestrians in the scene.

How to prune a trained model

When using a deep learning model for production, while it’s important for the model to make accurate predictions, it also matters as to how efficiently these predictions are made. TLT provides a key feature known as model pruning, which addresses these concerns. Model pruning helps in reducing the memory footprint of the trained deep learning model, thereby reducing the number of operations from input to output during inference. In most common Intelligent Video Analytics (IVA) use cases, model pruning reduces the number of parameters by an order of magnitude, resulting in faster inferences and better utilization of compute resources.

Usually when a model is pruned, there is a slight deterioration in the accuracy as we may end up removing some previously useful connections. However, by retraining this model, we are able to recover this lost accuracy. One may ask, if this is the case, then why not train a much smaller model from the get-go. Well, it is much harder to train a smaller model to reach the same accuracy as that of a larger one, simply because one would have to get the architecture and the initializations correctly the first time in order to get the best performance. This is also known as Lottery Ticket Hypothesis.

Pruning has shown to increase the throughput for video processing. In our in-house experiment, we found that a pruned ResNet-50 three class detector was able to process close to 29 video streams concurrently at 30+ frames per second with inference running at 960x544 resolution, as opposed to the equivalent unpruned model which processed 9 streams at 30FPS at the same resolution. In this case, pruning was able to reduce the size of the model by 6 times without sacrificing accuracy. In another internal experiment, we found that a pruned ResNet18 model which was 2x smaller than the original model was able to achieve comparable accuracy on ImageNet-1000.

Without further ado, let’s prune the newly trained model using the tlt-prune command.

tlt-prune -pm /path/to/your/model.hdf5
          -o /path/to/output/dir
          -eq intersection
          -pth 0.1
          -k $KEY

The command will output the pruning ratio, so you may want to adjust the pruning threshold (-pth) to get the performance you desire. In this case, we got a pruned model, which is 10.8% of the original one. Almost 10 times smaller!

2019–08–23 02:15:21,960 [INFO] modulus.pruning.pruning: Exploring graph for retainable indices2019–08–23 02:15:22,868 [INFO] modulus.pruning.pruning: Pruning model and appending pruned nodes to new graph2019–08–23 02:15:44,073 [INFO] __main__: Pruning ratio (pruned model / original model): 0.108476362418

It’s also recommended to try out different equalization criterion or -eq, when your model has merging branches, like the skip connections in ResNet18. Once the model is pruned, we should create a new experiment specification file with the model_config module updated as follows,

model_config {
 pretrained_model_file: “/path/to/the/pruned/model”
 num_layers: 18
 use_batch_norm: true
 load_graph: true
 objective_set {
   bbox {
     scale: 35.0
     offset: 0.5
   }
   cov {
   }
 }
 training_precision {
   backend_floatx: FLOAT32
 }
 arch: “resnet”
}

Please note to set the load_graph option to True, so the compressed DNN graph from pruning is loaded. When we initially trained the model in Step 1, we didn’t set this parameter because we were only interested in the pre-trained weights.

You may now retrain the pruned model using the tlt-train command:

tlt-train detectnet_v2 -e <retrain spec file>
                       -r <output dir>
                       -k <key>
                       --gpus <num GPUs>

Great! You just got yourself a new model which is 10X smaller and has comparable accuracy!

Power efficiency and speed of response are two key metrics for deployed deep learning applications because they directly affect the user experience and the cost of the service provided. TensorRT automatically optimizes trained neural networks for runtime performance. Once retraining completes, the model can be exported to a format usable by the DeepStream SDK. A small utility called TLT converter is included. The converter takes a model that was exported in the TLT docker using tlt-export and converts it to a TensorRT engine. Optionally, the export creates a calibration cache file to perform INT8 TensorRT engine calibration during the conversion. Example of basic usage for fp16 and fp32 data types, no calibration cache generated:

tlt-export -k <key>
           -o <output file>
           --export_module detectnet_v2
           --outputs <output blobs>
           --data_type {fp32,fp16,int8}
           --input_dims <input dimensions>
           --cal_data_file <calibration file>
           --cal_cache_file <calibration cache>
           --batches <batch size>
           --cal_batch_size <calibration batch size>
           --max_batch_size <max batch size>
           --max_workspace_size <max workspace size>
           --experiment_spec <experiment_spec_file>
           input_file

For example, the output blob names (--outputs) for classification and detectnet_v2 are listed below:

classification: predictions/Softmax
detectnet_v2: output_bbox/BiasAdd,output_cov/Sigmoid

Supported data_types include: FP16, FP32 or INT8 and maximum batch size and maximum workspace size can be specified on the command line. Model calibration can be performed based on a calibration data file for the INT8 engine. The calibration data file is a subset of the training dataset, that has been preprocessed and is ready for ingestion by the network. To generate a sample calibration data file, we may use the command as mentioned below

tlt-int8-tensorfile detectnet_v2 -e <experiment_spec_file>
                                 -m <number_of_batches_to_cal>
                                 -o <path_to_cal_tensorfile>

To export an INT8 model, we can simply apply the calibration file we just generated to cal_data_file.

tlt-export model.tlt
           -k $KEY
           --export_module detectnet_v2
           -o output.etlt 
           --outputs output_cov/Sigmoid,output_bbox/BiasAdd
           --input_dims 3,384,1248
           --max_workspace_size 1100000
           --cal_data_file calibration.tensor
           --data_type int8
           -batches 10
           --cal_cache_file calibration.bin

As shown in the figure below, the pruned model is able to achieve roughly 10 FPS with 384x1248 RGB inputs, which is almost 2.4x faster than the original model. You might be able to get an even smaller model with iterative pruning with better speed-vs-accuracy tradeoff by following the getting started guide, have fun training with the Transfer Learning Toolkit!

What we covered

We showed a step-by-step tutorial on how to train and fine-tune pre-trained models with your custom data and use the underlying GPU architecture.
We introduced Transfer Learning Toolkit, that enables neural nets to learn new classes of objects that are domain-specific.
You’ll notice the 10X pruned ResNet18 network on the KITTI dataset is 2 to 3X faster compared to the unpruned model with batch sizes ranging from 1, 2, 4 and 8. The benefit of pruning the network is to allow you to pack complex applications, increasing throughput and stream density without compromising the model accuracy.
The pruned models can be easily exported to NVIDIA TensorRT for optimized inference performance and scaled for deployment with NVIDIA DeepStream SDK.
You can enable multi-GPU support with TLT for your application and deploy on a GPU-accelerated platform in your data center, cloud, on-premise, or in a local workstation for further use with DeepStream SDK plug-ins.

Additional resources:

Getting started guide
Learn how to create and manage real-time object detection applications for disaster response in this developer webinar.

Authors:
Yu Wang (Sr. System Software Engineer — Deep Learning)
Varun Praveen (Sr. System Software Engineer — Deep Learning)
Farzin Aghdasi (Sr. Software Manager — Deep Learning)
Amulya Vishwanath (Sr. Product Marketing Manager, AI)