Computer Vision in production –Nvidia DeepStream

Published in

VirtusLab

13 min readJan 28, 2022

We examine Nvidia DeepStream, an emerging solution for implementing computer vision within your software projects.

If you blinked while reading the title of this article, someone somewhere will have published a new research paper related to Artificial Intelligence (AI), Deep Learning (DL) — or Computer Vision (CV).

The quantity of intriguing research papers you can find in 5 minutes would well exceed your daily reading capacity. But, let us be honest, the vast majority of these studies present the AI-related state-of-the-art from 5 to 10 years back. Furthermore, every day, AI enthusiasts are exposed to thousands of blog articles displaying proofs of concept (POCs) and demos of research work applications.

One result of this never-ending stream of information is increased confusion among those businesses seeking to adopt AI developments to add real value. The maturity of, and hype around, a specific study field impacts the rate of adoption. The Gartner Hype Cycle for Artificial Intelligence neatly encapsulates this relationship.

Gartner Hype Cycle for Artificial Intelligence, 2021

The diagram places Computer Vision (CV) at the start of the slope of enlightenment, with the expectation it will reach the plateau of productivity in 2–5 years. Interestingly, Deep Learning (DL) is thought to have already reached its pinnacle of exaggerated expectations and is likely to take the same amount of time to get to the plateau of productivity. Although Computer Vision can be thought of as a traditional image analysis method, in reality, many solutions in that domain are intertwined with Deep Learning. We see the two disciplines are often interconnected. With both fields still in the disillusionment phase, we may doubt the stability of solutions built upon them. Arguably, we already have robust products on the market designed to solve particular Computer Vision tasks (like image classification, object detection, or semantic segmentation) empowered by DL models. So what do we actually expect to see when the plateau will be reached? Predicting the future is not often easy, but I personally wish that CV techniques will be employed more generally in IT systems, nearly as first-class citizens for many projects. I also believe that adopting CV- and DL-related solutions, both in-house and third-party, will be easier. To make this possible, solid technologies for hosting those solutions in various contexts must develop. Hopefully, such a tool is being developed right before our eyes.

And now is the time to introduce the main topic of this article. If you ever considered hosting a DL model in your application, you’ve probably also thought about a cloud-based solution to make it easy to wrap the model with a REST API. You may also want easy model deployment, which is a way to bundle the models that Data Scientists create into an asset so the system can serve predictions based upon it. If that’s your use case, a few solid tools will help. However, when you serve online predictions based on video or audio feeds, the situation changes. And adding an edge environment to the use case increases the complexity further. What to do if you need to make a system production-ready under these circumstances? Well, here comes Nvidia DeepStream.

This article provides an overview of Nvidia DeepStream SDK, its foundations, features, and insights from my perspective as a person who develops and maintains services built upon DeepStream. Naturally, other authors have also described the SDK and its application. I consider this article to be the first of a series describing the internals of DeepStream. I will release the following articles if we see the topic is of interest to our readers.

What can Nvidia DeepStream be used for?

Thinking of DL models as elements within a service or a data processing pipeline helps us understand how we can deploy these models. Primarily, the deployment is part of a batching process, repeatedly sharing input and inference results in memory. Furthermore, when a model needs to be exposed to customers, it is typically wrapped using REST API. A few variants are available here. For example, we could use the classical REQUEST-RESPONSE pattern; alternatively, we could just submit a job to a queue, assuming the client will check its status until the processing is complete. However, how to handle online stream processing is still an unanswered question. We can find a suitable tool to handle stream processing in a general software development setting. But, in the case of Deep Learning, it may not be such an easy task. One of the factors contributing to the difficulty is that DL models can handle various types of data. That is why the problem is actually determined by the context model usage. In particular, to produce object detection predictions online, we need a tool capable of processing video streams efficiently and hosting our model. In essence, this is what DeepStream can do.

Assuming we want to build a solution for online video processing with DL models onboard, the other question that must be answered is the target environment. In most cases, we would have the luxury of hosting our solution in a Data Center or in the Cloud. But what if this is not the case? At first glance, edge seems a little bit scary. So, should we cry into our pillows? Hopefully, DeepStream will be our salvation in that case too. In fact, even relatively small ARM devices (which have far better architecture than x86) can host complex DL models under the DeepStream umbrella.

https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-agx-xavier/

After reading the above paragraphs, the obvious conclusion is that NVidia has built a tool that sits in a particular niche. DeepStream seems to be the best, and possibly only, answer for the requirements I stated above. But is it really as great as it looks? What kind of features and extensions does it provide? How can I adopt DeepStream in my project, and what kind of difficulties may I encounter while doing this? Of course, all those questions are valid, but let’s begin at the beginning for now.

The foundations of Nvidia DeepStream

Basic video processing foundations are necessary to build a tool to host DL models and serve inference on a video feed. Fortunately, a stable solution already exists. It is called Gstreamer. The Nvidia developers built their solution on top of this.

The concept of Gstreamer is simple, so this basic diagram should quickly give you an understanding:

https://gstreamer.freedesktop.org/documentation/tutorials/basic/concepts.html

Gstreamer has plenty of plugins that can be categorized as the source (emitter of stream), a filter (responsible for making operations on stream), and the sink (the terminal element of the stream). In addition, plugins can be arranged as pipelines to serve various needs. And much more complicated pipelines than the one shown above can be built:

https://gstreamer.freedesktop.org/documentation/tutorials/basic/short-cutting-the-pipeline.html

The source code of Gstreamer is written in C, with bindings for other languages also available. It is a relatively low-level framework, which may seem scary initially, but reassuringly the tool is very well documented. The return on the money and personal energy you put into learning Gstreamer is a robust, stable, and highly efficient tool.

I want to share some examples to show that Gstreamer is not a monster to be afraid of. You’ll need the following to run these examples:

Docker (preferably with the Nvidia-docker extension) installed on your computer
Deepstream docker image: nvcr.io/nvidia/deepstream:5.0.1–20.09-devel
A GPU — if you want to experiment further with DeepStream on your own

After running the container, you can try to run the first Gstreamer pipeline:

gst-launch-1.0 -e -v \
    videotestsrc pattern=ball flip=1 ! \
    video/x-raw,width=1280,height=720 ! \
    x264enc ! \
    mp4mux ! \
    filesink location=simple_example_with_ball.mp4

The command should generate the video file named simple_example_with_ball.mp4 in your current directory. It is meant to generate a test video and encode it with h.264 into a *.mp4 file. When the command is prefixed with GST_DEBUG_DUMP_DOT_DIR=. you will also be able to find *.dot files containing a dump of the pipeline shape.

The output video should look like this:

Impressive, right? Only kidding! But let me now show you a more complicated pipeline that illustrates how Gstreamer handles the processing of multiple video streams.

gst-launch-1.0 -e -v \
    videomixer name=mix \
    sink_0::xpos=0 sink_0::ypos=0 \
    sink_1::xpos=640 sink_1::ypos=0 \
    sink_2::xpos=0 sink_2::ypos=480 \
    sink_3::xpos=640 sink_3::ypos=480 \
    ! videoconvert ! x264enc ! mp4mux \
    ! filesink location=complex_example_with_balls.mp4 \
    videotestsrc pattern=ball flip=1 foreground-color=-32128 \ 
    ! video/x-raw,width=640,height=480 ! mix. \
    videotestsrc pattern=ball flip=1 foreground-color=-16348 \
    ! video/x-raw,width=640,height=480 ! mix. \
    videotestsrc pattern=ball flip=1 foreground-color=-8192 \
    ! video/x-raw,width=640,height=480 ! mix. \
    videotestsrc pattern=ball flip=1 foreground-color=-128 \
    ! video/x-raw,width=640,height=480 ! mix.

Looks complicated, but the explanation is really quite simple:

videomixer (named mix) is a plugin that can combine multiple streams — it has a dynamic number of sink pads (sink pads are the input for the plugin) that are configured with a position on the output image
output from videomixer is connected sequentially with elements responsible for conversion of the video into h.264 format
similarly to the first example, at the end of the pipeline you can see filesink element that is meant to dump the video into a file
the last lines define the input video stream — again we need to create some dummy video for the sake of a quick demo
each videotestsrc is configured and connected into mix.The pipeline’s description may be confusing at first, but this is just a simple way to linearise non-linear structure.
for further reference — visit Gstreamer documentation page

The pipeline which is created looks like that:

And the output video is slightly more entertaining than the first one.

Honestly speaking, those examples were elementary. They were simply meant to illustrate Gstreamer’s video processing potential. Please visit the Gstreamer documentation, mainly the tutorials, for those who want to learn more.

Bringing the original topic back into the foreground: how does Nvidia DeepStream relate to Gstreamer? The most straightforward explanation can be read below:

Nvidia chose to give Gstreamer access to proprietary plugins and extensions that use their own libraries (like CUDA and TensorRT) to get the best out of their GPU cards. Since many DeepStream releases have now been published, we can see they succeeded in delivering a solid and performant piece of work.

DeepStream and its features

I’ve shared my positive thoughts about DeepStream several times already, but I didn’t provide any justification for those claims. That does not mean they came out of the blue. The first aspect I am curious about relates to performance. A quick glimpse at the official report provided by Nvidia shows you how good a job their developers have done in terms of optimization. The graphs illustrate inference for one of the object detection models, PeopleNET, which was pruned and quantized using Nvidia proprietary tools.

https://docs.nvidia.com/metropolis/deepstream/dev-guide/text/DS_Performance.html

Performance is excellent even on ARM devices, such as the Nvidia Xavier. Plenty of evidence for DeepStream’s performance can also be found in reports from people not directly related to Nvidia. For example, on YouTube, it is pretty easy to search for object detection demos showing the brilliant performance of the tool.

I must admit that when I first realized the level of DeepStream performance, I was shocked, and I would not be surprised if people reading this post who have experience working with DL models shared the same feelings.

Why is DeepStream so performant? One factor has already been mentioned: its foundation is a low-level tool that is highly efficient in video processing. But saying that is almost like saying nothing. Nvidia built Gstreamer extensions related to hosting DL models and also boosted the performance of the native Gstreamer plugins. This resulted in a custom stream decoder that also unleashes the video decoding power of GPUs. Nvidia has taken their hardware and low-level libraries and mixed them with Gstreamer, positioning themselves as providers of end-to-end foundations for DL platforms.

It is clear that Nvidia plans to reach out more advanced solutions, where DeepStream is only one component of the whole ecosystem.

https://developer.nvidia.com/deepstream-sdk

For everyone working with ML and DL models daily, it is clear that a tool to deploy models is not enough. As ML solutions become more mature, there is a solid demand to put MLOps in place. Only the combination of MLOps tools, deployment tools, analytical tools, and the flexible environment of Data Science experiments can be seen as a good baseline for an AI-oriented team to work efficiently. Does Nvidia already provide all of those components? No, at least not to the full extent. However, some components are already prepared and probably further ones yet to come. In that context, Transfer Learning Toolkit is perhaps worth describing briefly.

This solution allows users to select a pre-trained model that handles Computer Vision tasks and fine-tune it to fit the needs of their own user’s. Apart from model training, it supports model pruning and quantization. Those techniques are (apart from integration with hardware) the main factors contributing to performance. Without them, the throughput of model inference under DeepStream will not be significantly better than inference in frameworks like Tensorflow or PyTorch. Worth mentioning also is that Nvidia uses its own format for model representation, called TensorRT.

https://developer.nvidia.com/blog/tensorrt-3-faster-tensorflow-inference/

TensorRT allows us to merge or transform the inner DL model building blocks making the inference highly efficient on a specific hardware type (which is why the conversion is done on the same kind of hardware meant to host the deployed model).

Within the toolkit, the user may pick up one of the baseline models prepared by Nvidia:

https://developer.nvidia.com/tao-toolkit

Additionally, taking into account a little bit broader historical context — Nvidia seems to be improving the solution over time. Reaching back to my first memories related to their technologies related to AI, I must confess that they did significant progress in the advancement of their tools. Considering additionally the Gartner diagram cited in the introduction to the post — probably best is yet to come.

DeepStream: a silver bullet?

The obvious answer is no, as no tool fills that description. However, as an engineer working with DeepStream for quite a while, I can pinpoint the main issues that need to be solved in future development on the Nvidia side. First of all, the level of flexibility of the solutions is not as great as it may seem to be. For example, we often want to deploy a custom model in the DeepStream pipeline. So that will work, the model needs converting to either an intermediate format (like ONNX, UFF) or to the target format, TensorRT. This process is often complicated, especially when the model has custom building blocks not supported by either the intermediate format or TensorRT. If that’s the case, custom extensions need to be prepared. Translation of the most recent state-of-the-art models is not easy.

Another problem to mention is the quality of documentation. It is obviously getting better with each and every release but sometimes, to fully understand how DeepStream works, I needed to dive into the source code of the SDK (which is available in devel version of DeepStream docker image). Should more help be required, the Nvidia developers forum is there to come to the rescue.

Most developers would prefer to create services in high-level programming languages rather than C/C++. DeepStream, on the other hand, is a low-level tool natively based on C and C++. There is, for instance, a Python library wrapping the tool. Still, my personal preference is to write the pipeline definition in C rather than in Python when everything, including memory management, must be done as if in C. So, either you are comfortable with creating modules of your service in a low-level programming language, or you need to wait for the release of more robust high-level wrappers.

Clearly, both DeepStream and Transfer Learning Toolkit are not yet stable. We should regard them as tools still under heavy development. It is obviously a good state (and expected when compared with the Gartner Hype Cycle diagram), but it comes with a price: changes in the API. Nvidia tries to make its solutions backwards compatible. Considering that changes are sometimes done at the plugin design level (like the change of responsibility for nvstreammux) there can be a lot of confusion. The pace of development indicates that there will be further changes soon, so if you try to build services based on DeepStream now, be ready for them; otherwise, the maintenance may be a tough experience.

Final thoughts

DeepStream is definitely worthy of attention. But, just like most emerging software projects, it has its issues. Nevertheless, it also shows tremendous potential.

If the pace of development is kept up, we can expect it to be a strong foundation for many great solutions in 2–3 years. Fingers crossed that Nvidia invests funds and effort in further work on the project.

I hope this article turned out to be useful for you. If you liked it, please share your thoughts in the comments section. Having your feedback will help me improve future content.

If you are interested in working on projects where DeepStream is included in the tech stack, check out the open positions at VirtusLab.