Learn how to run a high-performance object detection pipeline for inference on GPUs in 10 mins.

Jun 19 · 14 min read

Object detection is a very popular application for deep learning, used from simple home automation apps to safety critical autonomous driving. GPUs have proven to be very powerful at executing deep learning training and inference. Many libraries and tools exist today to perform these tasks. In this piece, we’ll show you how to run a high-performance object detection pipeline for inference on GPUs in 10 mins.

Our Python application takes frames from a live video stream and performs object detection on GPUs. We use a pre-trained Single Shot Detection (SSD) model with Inception V2, apply TensorRT’s optimizations, generate a runtime for our GPU, and then perform inference on the video feed to get labels and bounding boxes. The application then annotates the original frames with these bounding boxes and class labels. The resulting video feed has bounding box predictions from our object detection network overlaid on it. The same approach can be extended to other tasks such as classification and segmentation.

While knowledge of GPUs and NVIDIA software is not necessary, you should be familiar with object detection and Python programming to follow along. Some of the software tools used include Docker containers from NVIDIA GPU Cloud (NGC) to set up our environment, OpenCV to run the feed from the camera, and TensorRT to speed up our inference. While you will benefit from simply reading this post, you need a CUDA capable GPU and a webcam connected to your machine to run the example.

By the end of this post, you will understand the components needed to set up an end-to-end object detection inference pipeline, how to apply different optimizations on GPUs, and how to perform inference in FP16 and INT8 precision on your pipelines. For reference, all the code (and a detailed README on how to install everything) can be found on the NVIDIA GitHub page.

Test that you have a working GPU with the command nvidia-smi. The list of CUDA GPUs is on this page.

The network we are using is a Single Shot Detection network with InceptionV2 as the backbone. All the code used in this app is available in this GitHub repo.

Run the Sample!

We use Docker containers to set up the environment and package them for distribution. We can recall numerous occasions where using containers made it very easy to recover from conflicts and crashes in no time, so be sure you have Docker and NVIDIA Docker on your machine before trying out this example.

Navigate to the main object-detection-webcam folder and run this section below to build the container and run the application:

./setup_environment.shpython SSD_Model/detect_objects_webcam.py

This should bring up a window showing the video feed from your webcam with bounding boxes and labels overlaid as in figure 1.

Figure 1. The output on the command prompt displays the time taken for inference and the Top-1 prediction of target classes

Setup with NGC and TensorRT open source software

Let's review the setup, all the code for setup is available in setup_environment.sh. There are 4 key steps:

  1. Setting environment variables for Docker to see webcam
  2. Downloading the VOC dataset to use for INT8 calibration (which we will see later in the blog)
  3. Building a Dockerfile containing all the libraries we need to run the code
  4. Starting that Dockerfile so we can the application in the correct environment

Since we are using Docker containers to manage our environment, we need to give our container access to all the hardware in the host machine. Most of this is handled automatically by Docker, except the webcam which we add manually. We need to set permissions for Docker to access X11, which is used to open the GUI for webcam feed. Do this by using environment variables and setting permissions that are passed into the container during the docker run command.

Next, we download the PASCAL VOC dataset for INT8 calibration, which we cover in later sections. This dataset contains images of common household items and everyday objects.

Then we build a Dockerfile which has our entire development environment. The Dockerfile installs the following components:

  1. TensorRT and required libraries
  2. TensorRT open source software, replace plugins and parsers in TensorRT installation
  3. Other dependencies for our application

Installing TensorRT is very simple with the TensorRT container from NVIDIA NGC. The container contains required libraries such as CUDA, cuDNN, and NCCL. NGC is a repository of pre-built containers that are updated monthly and tested across platforms and cloud service providers. See what’s in the TensorRT container in the release notes. Since we need to combine multiple other libraries and packages in addition to TensorRT, we will create a custom Dockerfile with the TensorRT container as the base image.

Since the newest versions of TensorRT plugins and parsers are available as open source, we are using them in our example. Plugins provide a way to use custom layers in models within TensorRT and are already included in the TensorRT container. The SSD model, for example, uses the flattenConcat plugin from the plugin repository. Strictly speaking, we did not need to use the open source versions of plugins in this example, using the versions shipped in the TensorRT container would have worked as well. It’s handy to know and you can extend and customize these components to support custom layers in your models.

To get open source plugins, we clone the TensorRT GitHub repo, build the components using cmake and replace existing versions of these components in the TensorRT container with new versions. TensorRT applications will search for the TesnorRT core library, parsers, and plugins under this path.

Finally, we can install the other dependencies that we need for our application, which is mainly just OpenCV and its rendering libraries. OpenCV is a computer vision library which we use to interact with our webcam.

Use the docker build command to build all components in the Dockerfile:

Start the container to open your new development environment as shown below. In this command, we set the runtime to Nvidia to let Docker know that our host machine has GPUs, then we mount the GitHub repo into the Docker container to access the code within, and finally forward information about how to interact with the webcam through the subsequent mounting and environment variables. For more information on the flags we used, check out the Docker documentation.

docker run — runtime=nvidia -it -v `pwd`/:/mnt — device=/dev/video0 -e DISPLAY=$DISPLAY -v $XSOCK:$XSOCK -v $XAUTH:$XAUTH -e XAUTHORITY=$XAUTH object_detection_webcam

Once the container starts, you can run your application using:

python detect_objects_webcam.py.

Optimize Model, Build Engine for Inference

Within detect_objects_webcam.py, the pseudo code for this application is as follows, also shown in figure 2:

Figure 2. This blog will cover all the steps in this workflow, from building the TensorRT engine to plugging it into a simple application.

This first step is to download the frozen SSD object detection model from the TensorFlow model zoo. This is done in prepare_ssd_model in model.py:

The next step is to optimize this model for inference and generate a runtime that executes on your GPU. We use TensorRT, a deep learning optimizer and runtime engine for this. TensorRT generates runtimes from this application for every NVIDIA GPU. You need the application to deliver the lowest latency possible to perform inference in real-time. Let’s see how to do that with TensorRT.

Convert the frozen TensorFlow graph to Universal Framework Format (UFF) using the utility available in model.py. You now import the UFF model into TensorRT using the parser, apply optimizations, and generate a runtime engine. Optimizations are applied under the hood during the build process and you don’t need to do anything to apply them. For example, TensorRT may fuse multiple layers such as convolution, ReLU, and Bias into a single layer. This is called layer fusion. Another optimization is tensor fusion or layer aggregation, in which layers that share the same input fuse into a single kernel and then their results are de-concatenated.

To build a runtime engine you need to specify four parameters:

  1. Path to UFF file for our model
  2. Precision for inference engine (FP32, FP16, or INT8)
  3. Calibration dataset (only needed if you’re running in INT8)
  4. Batch size used during inference

See code for building the engine in engine.py. The function that builds the engine is called build_engine.

Inference in lower precision (FP16 and INT8) increases throughput and offers lower latency. Using FP16 precision provides several times faster performance on Tensor Cores than FP32 with effectively no drop in model accuracy. Inference in INT8 can lead to further performance gains with less than a 1% drop in model accuracy. TensorRT chooses the kernels from FP32 and any precision that you allow. When you enable FP16 precision, TensorRT chooses kernels from both FP16 and FP32 precision. To use FP16 and INT8 precision, enable both to get the highest performance possible.

Calibration is used to determine the dynamic ranges of tensors in the graph so you can use the restricted range of INT8 precision effectively. More on that later.

The last parameter, batch size, is used to select the best kernels for the inference workload. You can use an engine for a smaller batch size than specified during its creation. However, the performance might not be ideal. I typically generate a few engines for the most common batch sizes that I expect and switch between them. In this example, we will be grabbing one frame at a time from the webcam, making the batch size one.

It’s also important to note that TensorRT automatically detects any specialized hardware that you have on your GPU. So if your GPU has Tensor Cores, it will automatically detect that and run your FP16 kernels on those Tensor Cores.

Let’s take a look at engine.py to see how all of those parameters work:

The build_engine function creates an object for the builder, parser, and network. The parser imports the SSD model in UFF format and places the converted graph in the network object. While we are using the UFF parser to import the converted TensorFlow model, TensorRT also includes parsers for Caffe and ONNX. Both are also available in the TensorRT open source repo. Using the ONNX format of this model simply means calling ONNXParser instead; the rest of the code would be the same.

Line 71 specifies the memory that TensorRT should use to apply optimizations. This is just scratch space and you should provide the largest size that your system allows; I provided two GB. The conditional code follows to set parameters based on the precision for inference. For this first run, let’s use the default FP32 precision.

The next few lines specify the name and shape of input nodes and output nodes for the parser. The parser.parse actually executes the parser on our UFF file using the parameters we have specified above. Finally, builder.build_cuda_engine applies optimizations to the network, and generates the engine object.

The script engine.py has two additional key functions: save_engine and load_engine. Once you have generated an engine, you can save it to disk for future use, a process called serialization. Serialization generates a plan file that you can subsequently load from disk, generally much faster than rebuilding the engine from scratch. That’s what these load and save functions do. If you do change the parameters used to build the engine, the model used, or the GPU you use, you need to regenerate the engine as TensorRT would choose different kernels for building the engine.

You can download plan files for several combinations of pre-trained models, parameters and precisions from NGC models. If I am using a standard model, the first thing I generally check is if there is a plan file available on NGC to use directly in my application.

Run Inference With TensorRT Engine

We can now use the TensorRT engine to perform object detection. To use the engine in our example, we will take one frame from the webcam at a time and pass it to the TensorRT engine in inference.py, more specifically in the function infer_webcam:

This function first loads the image from the webcam (line 174) and then performs a few pre-processing steps in the function load_img_webcam. Our example shifts the order of the axes from HWC to CHW, normalizes the image so all the values fall between -1 and +1, and then flattens the array. You can also add any other preprocessing operations you need for your pipeline in this function.

A timer starts in line 182 to measure the time it takes for our TensorRT engine to perform inference. This is useful to understand the latency of the whole inference pipeline.

We call do_inference to perform inference. This function sends our data to the TensorRT engine for inference and returns two parameters: detection_out and keepCount_out. detection_out contains all the information about the bounding box coordinates, confidence, and class labels for each detection, and keepCount_out keeps track of the total number of detections the network found.

Putting It All Together

So far we have looked at how to import a pre-trained model from TensorFlow model zoo, convert it to UFF format, apply optimizations and generate a TensorRT engine, and finally use the engine to perform inference on a single image from the webcam.

Let’s see how all these components come together in detect_objects_webcam.py:

After parsing command line arguments, prepare_ssd_model uses model.py to convert from frozen TensorFlow graph to UFF format. Then we initialize a TensorRT Inference object in line 153, that uses build_engine in engine.py as discussed above to actually builds the TensorRT engine. As mentioned earlier, if we do not have an engine file already saved at our args.trt_engine_path, then we need to build one from scratch. The same goes for the UFF version of our model. We will run in default FP32 precision, which eliminates the need to provide a calibration dataset. Lastly, since we run live inference on just one webcam feed, we will keep our batch size = 1.

Now let's integrate this into the application that operates the webcam. If the camera flag is turned ‘on’ (default), the app will start a video stream using OpenCV (line 164) and enter the main loop in line 167. In this loop, we will be constantly pulling in new frames from the webcam, as shown in line 169, and then will perform inference on that frame as shown in line 172.

Finally, we overlay the bounding box results onto the original frame (lines 176–180), and then finally display them back to the user using imshow.

And that’s our whole pipeline!

Inference in INT8 Precision With TensorRT

The app performs inference several times faster using TensorRT on GPUs compared to in-framework inference. However, you can make it several times faster yet. We have so far used single precision (FP32) for inference where every number is represented using 32 bits. In FP32, activation values can lie within a range of +/- 3.4x10³⁸ and require 32 bits to store each number. Larger numbers require significantly more storage for execution and also result in lower performance. Most models perform with nearly identical accuracy when switched to use lower precision FP16. Using models and techniques provided by NVIDIA enable you to get the highest performance possible using INT8 precision for inference. However, notice the significantly lower dynamic range that can be represented with INT8 precision in Fig 2.

Figure 2. The dynamic range of values that can be represented at in FP32, FP16, and INT8 precision

To use INT8 precision and obtain accuracy similar to FP32 inference, you need to perform an additional step called calibration. During calibration, you run inference on training data that is similar to your final dataset and collect ranges for the activation values. TensorRT then calculates a scaling factor to distribute the range of INT8 values over this range of activation values for each node. Figure 3 shows that if the activation range for a node lies between -6 and +6, you want the 256 values that can be represented with INT8 to cover only this range.

Figure 3. Calibration and quantization are critical steps for convert to INT8 precision.

Use the command below to re-build a TensorRT engine to use INT8 for precision in your application, perform calibration, and run inference. The whole process might take a few minutes:

python detect_objects_webcam -p 8

You should see the same result with higher performance than that achieved with FP32 precision earlier.

Let’s look at how this is done in the build_engine in engine.py. Based on the precision you enable for inference, the conditional block enables different builder modes. By default, TensorRT always chooses FP32 kernels. If you enable FP16 mode, it also tries kernels running in FP16 precision; the same goes for INT8.

However, just because you allow lower precision kernels, doesn’t mean that those will always outperform higher precision kernels in performance. For example, even though we set our precision mode to INT8, there may be some FP16 or FP32 kernels that end up running faster. TensorRT will choose whatever best optimizes for speed.

TensorRT detects the presence of specialized hardware, such as Tensor Cores, and will use FP16 kernels on them to get the highest performance possible. The ability of TensorRT to choose the best kernels automatically is called kernel autotuning. This makes it possible to use TensorRT across a wide variety of applications while delivering high performance.

Notice that in the INT8 conditional block we use a function SSDEntropyCalibrator. This class runs calibration data through your model during calibration in batches. For this, all you need to do is implement the function called get_batch in calibrator.py fetch the next batch of data from your calibration dataset. See code for SSDEntropyCalibrator in calibrator.py below:

This function takes a directory of images as input to calibrate and a location to store the cache file. This cache file contains all the scaling factors you need for your network activations. If you save the activation values, you only need to run calibration once for a particular configuration and can just load this cache table for any subsequent runs.

And that’s all you need to do to perform INT8 calibration with TensorRT!

Where to From Here?

This post shows how to set up and run an object detection application on GPUs quickly. It covered a lot of ground including setup, deploying in INT8 precision, using the newly open sourced plugins and parsers in TensorRT, connecting to a webcam and overlaying results.

Will leave you with a few resources related to this post:

We hope you enjoyed reading this post as much as we enjoyed developing it. Over to you, how do you use GPUs for inference?

We are always looking for cool app ideas for blogs and tutorials. Tell us what you find most challenging by leaving a comment below.

If you run into issues with using this app, be sure to check the issues in this sample’s GitHub repo for similar issues and solutions.

If you have questions about using TensorRT, always check the NVIDIA TensorRT Developer Forum to see if other members of the TensorRT community have a resolution first. NVIDIA Registered Developer Program can also file bugs at https://developer.nvidia.com/nvidia-developer-program.


[Liu et al. 2016] Liu, Wei, et al. “SSD: Single shot multibox detector.” European Conference on Computer Vision. Springer, Cham, 2016.

[Szegedy et al. 2016] Szegedy, Christian, et al. “Rethinking the inception architecture for computer vision.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

[Lin et al. 2014] Lin, Tsung-Yi, et al. “Microsoft COCO: Common objects in context.” European conference on computer vision. Springer, Cham, 2014.

NVIDIA Authors: Gary Burnett, Solution Architect & Siddharth Sharma, Product Marketing Manager

Better Programming

Advice for programmers.


Written by


Solving the unsolvable with deep learning. Revolutionizing analytics. Breaking down barriers. Learn more about where AI is creating real impact today.

Better Programming

Advice for programmers.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade