TLT - Lightning

Lightweight Experiment and Workflow Management for NVIDIA Transfer Learning ToolKit.

5 min readSep 25, 2020

Over the last years, deep learning methods have been shown to outperform previous state-of-the-art machine learning techniques in several fields, with computer vision being one of the most prominent cases.

In this article, we will discuss how SmartCow is Leveraging the power of Transfer Learning Toolkit 2.0 and provides easy to use Web User Interface to train complex deep learning models for solving emerging computer vision problems.

Before moving forward let’s quickly recall some Key features of TLT

GPU optimized pre-trained weights for computer vision tasks.
Easily modify configuration files for adding new classes and retraining models with custom data.
Reduce model size using pruning functionality.

TLT reduces the engineering effort from 80 weeks to ~ 8 weeks by rapidly training with NVIDIA purpose-built models to achieve higher throughput and accuracy in a shorter duration of time.

This isn’t as simple as it sounds, to train a simple SSD Model you have to go through numerous steps, and generating a config file for each step is a very tedious task.

Once the training is started, going through the console logs to figure out how does the model performed so far or how long do we have to wait to get the final results that is again not very intuitive.

This is where TLT-Lightning comes into the picture, a single tool to automate and organize all of the manual and complex steps.

Let’s take an example, suppose you want to train a Facial Mask detection model there are few questions that you should ask yourself:

What device will be used for inference
Is high accuracy necessary for this case
Does 10k training samples will be enough
What model architecture should be used
Expected throughput (FPS)

In this article, we are going to experiment using TLT-Lightning and share the results to choose the best architecture and device combination for maximum performance at minimum cost.

In the TLT-Lightning User Interface above, we filled project details and added 4 SSD Architectures each with 4 Nvidia V100 GPUs on 2 DGX worker machines for training in parallel.

Training Cards

We can clearly see each of the model’s training progress in realtime using their training cards, which allow to keep track of the training history. We left the training running until it’s completed to compare the elapsed time and the final results are stunning.

Overall mAP is the same however there is a significant difference in training time. By the time we took this observation, ResNet50 was still training but there was no further improvement so we stopped it.

Though we left the training running for 1500 epochs, the loss graph flattening over time gives a very good intuition that there is no further improvement and training can be stopped to save both time and resources.

Explain AI

Let’s go to the Explain AI section and explore the training history in depth.

Above you can see Hiplot showing correlations between epoch, loss, mAP, AP we are looking for the epoch where mask detection accuracy is highest and overall model loss is lowest to get the best performing model weight file.

Highly scalable

TLT-Lightning uses GPU clusters to run the training and is flexible enough to allow running tasks on single or multiple GPUs, It provides a highly scalable platform to run multiple trainings in parallel.

Inference video

The video is recorded on Jetson Nano, using SSD - ResNet10 Architecture.

Benchmark Results

We used default "deepstream-app" and unpruned ".etlt" model files to benchmark the performance. Since using EGL Sink limits the FPS to 60 even when sync=0, we decided to turn it off.Resolution: 1920x1080
BatchSize: same as number of streams
Tracker: NvDCF
Interval: 0

Note that performance can further be boosted if we prune and retrain the model, but that is something out of the scope of this blog.

Demystifying the dilemma

What device will be used for inference ?

The lowest powered, Jetson Nano is best fit for the usecase, it can handle upto 4 streams very easily and if high frame rates are not the priority even 8 streams can be used.

2. Is high accuracy necessary for this case ?

Yes, high accuracy is important to correctly identify if the person has wore a facemask or not, the applications can be law enforcement, access authorization and much more.

3. Does 10k training samples will be enough ?

Yes, 10k training sample are good enough if the device is deployed on a targeted location.

4. What model architecture should be used ?

We saw the benchmarks above, though the ResNet50 out performed better than other architectures, but the difference is very small.
It depends on trade off between Number of streams and FPS.

5. Expected throughput (FPS) ?

We can very easily expect around 25 FPS to get the job done, if we heavily focus on more number of streams on less powerful device.

Resources

Blog written by: Nitin Rai (AI Application Developer)
Technical contributors: Rohan, Pradeep, Sreevardhan, Aditi
Special thanks to: Saurabh Jain, Chintan Shah, Eddie Seymour, Charbel Aoun, Magnus Blomkvist