Use FFmpeg to Decode H.264 Stream with NVIDIA GPU Acceleration

5 min readSep 8, 2019

NVIDIA hardware acceleration architecture (Source: https://devblogs.nvidia.com/nvidia-ffmpeg-transcoding-guide/)

FFmpeg is a very popular and powerful video manipulation tool that can be applied on multiple tasks including decoding, transcoding and so on. It has been widely integrated into other tools which have the need to deal with the video stream. For example, OpenCV could use FFmpeg as backend for its function cv2.VideoCapture() and cv2.VideoWriter(). However, decoding and encoding works are somehow really computational-intensive. In this case, use CPU as a computational device is relatively low-efficient. If we use OpenCV to decode a RTSP stream using VideoCapture(), the CPU usage is very high like this:

Fortunately, many computational devices use high performance hardware to accelerate video pipelines, including NVIDIA GPU. So here we will introduce how to use NVIDIA Video Codec SDK to boost transcoding and use an example to illustrate how can we extract frames from video stream which could be applied in many deep learning computational vision tasks.

1. Installation

The core library is nv-codec-headers and CUDA. All requirements dependencies contain FFmpeg/libav and Nvidia driver. Following these steps to finish installation:

$ ./configure --enable-cuda --enable-cuvid --enable-nvenv --enable-nonfree --enable-libnpp --extra-cflags=-I/usr/local/cuda/include --extra-ldflags=-L/usr/local/cuda/lib64 
$ make -j 10

2. CPU decode VS GPU decode

After installation, we could decode a real-time H.264 RTSP video stream to check if we have already succeeded. But, for better comparison, we first run FFmpeg with default raw video decoder to see its cpu usage.

CPU

$ ./ffmpeg -y -vsync 0 -i rtsp://admin:startdt123@192.168.2.71/Streaming/Channels/1 -f rawvideo test.mp4

Brief explanation about this command:

-y means to override output files without asking
-vsync means to prevent duplication
-i means the video stream input source. Here is a RTSP stream.
-f means input file or output format. Normally, the format is auto detected for input files and guessed from the file extension for output files. So this parameter will determine which decoder or encoder to use.

And you will see the following outputs:

The CPU usage situation looks like this. BTW, my CPU is Intel Core i7 6800k.

GPU

Now let’s add NVIDIA hardware decoder head to this command to assign decoding computation to GPU.

./ffmpeg -y -vsync 0 -hwaccel cuvid -c:v h264_cuvid -i rtsp://admin:startdt123@192.168.2.71/Streaming/Channels/1 -f rawvideo -c:v h264_nvenc test.mp4

-hwaccel means to keep the decoded frames in GPU memory
-c:v h264_cuvid and -c:v h264_nvenc mean to select NVIDIA hardware accelerated H.264 decoder and encoder.
We can use nvdec to replace cuvid

Now, we can see that we have already put the transcoding work to GPU and reduced the CPU usage to 2%. Here is the memory allocation working mechanism when using -hwaccel or not.

Memory flow without hwaccel (Source: https://devblogs.nvidia.com/nvidia-ffmpeg-transcoding-guide/)

Memory flow with hwaccel (Source: https://devblogs.nvidia.com/nvidia-ffmpeg-transcoding-guide/)

But what will happen if we only put part of the transcoding flow to GPU such decoing? Like this command. It seems we would put decoding on GPU and encoding on CPU.

./ffmpeg -y -vsync 0 -hwaccel nvdec -i rtsp://admin:startdt123@192.168.2.71/Streaming/Channels/1 -f rawvideo test.mp4

The CPU usage is larger than the previous case but still much smaller than assigning on CPU. So to completely put all computation on GPU, flag to indicate both decoder and encoder that is NVIDIA ones is necessary.

3. How to read decoded frames into Python script directly

The next question is how we can extract the decoded images from the RTSP streams in Python. This is quite important since most of our deep learning applications are developed in Python. So it’s desirable to parse the decoded video bytes steam in Python to get back the initial RGB images. One way is to use named pipe. However, official NVIDIA encoder only supports yuv format. If you set output format to specific format like bgr24 or nv12, the conversion command could still work but in other encoders which works on CPU will take the position instead. Here we make a compromise strategy: use TensorFlow to do the format conversion from YUV to RGB on GPU. In fact, FFmpeg allows custom conversion and other operations defined under-filter_complex flag, a more elegant way.

Note that the bytes amount you read each time should correspond to the pixel format of the output stream. See this post for a detailed explanation for RGB and YUV format.

4. One more thing

Till here, we could roughly pipe the FFmpeg decode RTSP streams to Python application. But there are still a lot of problems you will meet in a pragmatical application. The most plaguing one should be the miss package error. Fortunately, as I said in the beginning, FFmpeg is a powerful library. By tuning the command line parameters, you could actually get really a stable and smooth video flow pipeline. And this post has a good summary of how to set the best settings for FFmpeg with nvenc. If you need a more complete and manipulative control over the transcoding pipeline, I believe the exhaustive documentation of official FFmpeg won’t disappoint you, even may be a little bit onerous.

Use FFmpeg to Decode H.264 Stream with NVIDIA GPU Acceleration

1. Installation

2. CPU decode VS GPU decode

3. How to read decoded frames into Python script directly

4. One more thing

Reference

Written by zong fan