Hardware Accelerated Transcoding — A marriage between FFmpeg, Containers, and Nvidia

Published in

GumGum Tech Blog

11 min readApr 24, 2024

FFmpeg is a versatile multimedia framework capable of handling virtually any media in existence. Decoding, encoding, transcoding, muxing, demuxing, streaming, filtering — it can do it all. More than a tool, it powers massive online streaming platforms like YouTube, Facebook, Instagram, Disney, and Netflix. It’s so popular FFmpeg has even found its way onto Mars, powering the Perseverance Rovers stream in 2021!

Very commonly FFmpeg is used to transcode videos: meaning manipulating video formats, audio channels, resolutions, etc. This process is software-based, granting compatibility with practically every cloud instance out there (like EC2). Why add a GPU and remove that versatility? Well, the answer is very simple. Money!

To clarify, the goal is to save costs by having an instance work much faster so that it has less uptime. For example a g4dn.xlarge costs twice as much as an octa-core instance (like a c7i.2xlarge) but it’ll transcode much faster. (about 3–4x faster!)

Another consideration is the acceleration of other operations using that same GPU. Very commonly FFmpeg is a means to end — as in, you’re using FFmpeg to process a file for your service. Why not accelerate those other steps as well? After all, a micro-service does not entail micro levels of processing. With the emergence of Machine learning & AI, so too has the usage of GPU instances. In a case like running a model for video analysis, it only makes sense to move transcoding operations to the GPU rather than keeping it on the CPU.

For the Verity team at GumGum, we integrate FFmpeg into our video pipelines, granting us incredible flexibility & speed when manipulating content for contextual analysis. It’s also allowed us to avoid expensive instances that have both a powerful CPU & GPU, and entire micro-services that would’ve been dedicated to this transcoding.

How Hardware Accelerated Transcoding Works

An important item to know is that the encoder & decoder on an Nvidia GPU are independent pieces of hardware — meaning they can run separately from the main CUDA processing cores. Built for speed, these NVENC / NVDEC cores were made to alleviate the computational strain of video processing (like streaming gameplay).

As such, there are 3 distinctions between CPU and GPU transcoding: speed, storage space, and quality.

CPU transcoding is slow, but has a minimal space footprint and great quality. It also scales vertically so the more cores you can throw at it the better.

GPU-based transcoding is almost the opposite: It’s incredibly fast, but can triple disk space usage with varying image quality. Proper settings can match CPU quality except at the highest levels.

Consider a simple FFmpeg use case: Take an .mp4, scale to 720p, and output it as another file. Here FFmpeg will perform several steps.

Split the container into individual streams (audio, video)
Decode the streams into their raw formats
Apply filters to said streams (like scaling to 720 to reduce file size)
Encode streams in specified formats
Mux the streams back into a single file

Where hardware acceleration will happen is the decoding, filtering (of certain filters), and encoding steps.

Audio processing is still done on the CPU as NVENC/NVDEC are only for video.

After decoding, raw video frames are sent to the VRAM allowing for GPU-accelerated filters. Post-filtering, the frames are encoded, sending them back to the main system’s RAM to be muxed and finished.

If certain filters or transformations cannot be done on the GPU, FFmpeg can be configured to send the decoded frames back into system memory / RAM as needed. Just remember each transfer is costly so try to keep the data in one place as much as possible.

Now let’s take a look at building a container capable of transcoding on our GPU.

Building an Accelerated FFmpeg

There’s a long list of packages needed to build an accelerated FFmpeg: toolkits, drivers, etc. For simplicity, we’ll use a g4dn.xlarge instance running on Ubuntu and equipped with Nvidia’s Deep Learning Base AMI, starting us with Nvidia drivers and Docker.

Once we launch with this AMI, we’re going to modify dockers default runtime to use Nvidia’s, allowing for GPU capabilities in the container.

After launching the instance edit the /etc/docker/daemon.json like so

{
  "default-runtime": "nvidia",
  "runtimes": {
    "nvidia": {
      "path": "nvidia-container-runtime",
      "runtimeArgs": []
    }
  }
}

Then restart Docker to apply the changes: sudo service docker restart

With that done, we can take a look at a sample Dockerfile that I’ve adapted from Nvidia’s FFmpeg guide.

FROM nvidia/cuda:11.8.0-runtime-ubuntu22.04

# Make interactions with installation not wait for our input
ENV DEBIAN_FRONTEND noninteractive

# Install build tools and libraries for video/audio encoding (x264, libmp3lame), and SSL support for handling URLs
RUN apt-get update && \
    apt-get install -y --no-install-recommends \
        build-essential \
        openssl \
        libssl-dev \
        yasm \
        cmake \
        libtool \
        libc6 libc6-dev \
        unzip \
        wget \
        libnuma1 libnuma-dev \
        pkg-config \
        nvidia-cuda-toolkit \
        git \
        libx264-163 libx264-dev libmp3lame-dev \
        zlib1g-dev

# Install Nvidia Codec Headers: files ffmpeg uses to enable hardware acceleration
RUN git clone https://git.videolan.org/git/ffmpeg/nv-codec-headers.git /opt/nv-codec-headers && cd /opt/nv-codec-headers && git checkout sdk/12.0 && make install

# Clone specific FFmpeg version, configured for Nvidia acceleration and codecs, and compiling to /opt/ffmpeg
RUN git clone https://git.ffmpeg.org/ffmpeg.git /opt/ffmpeg && cd /opt/ffmpeg && git checkout release/6.0 

RUN cd /opt/ffmpeg && ./configure --enable-nonfree --enable-cuda-nvcc --enable-nvenc --nvccflags="-gencode arch=compute_52,code=sm_52 -O2" --enable-libnpp \
    --enable-gpl --enable-libx264 --enable-libmp3lame --enable-openssl --extra-cflags=-I/usr/local/cuda/include --extra-ldflags=-L/usr/local/cuda/lib64 --enable-shared --prefix=/opt/ffmpeg \
    && make -j 8 && make install

RUN rm -rf /opt/nv-codec-headers

# Set up environment variables and link files to enable ffmpeg & ffprobe from the command line
ENV PATH="${PATH}:/opt/ffmpeg"
ENV LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/opt/ffmpeg/lib"
ENV NVIDIA_VISIBLE_DEVICES="all"
ENV NVIDIA_DRIVER_CAPABILITIES="compute,utility,video"

From there we can create a small docker-compose.yml file

---
version: "3"
services:
  accelerated-ffmpeg:
    build:
      context: ./
    platform: linux/amd64
    user: root
    privileged: true
    stdin_open: true 
    tty: true

And launch our container via docker compose run -rm accelerated-ffmpeg bashwhich should show a similar output upon entry

==========
== CUDA ==
==========

CUDA Version 11.8.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

While inside we can run nvidia-smi and ffmpeg -version to verify everything’s working

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       On  | 00000000:00:1E.0 Off |                    0 |
| N/A   20C    P8               9W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

ffmpeg version n6.0.1-6-gcd49ee45ba Copyright (c) 2000-2023 the FFmpeg developers
built with gcc 11 (Ubuntu 11.4.0-1ubuntu1~22.04)
configuration: --enable-nonfree --enable-cuda-nvcc --enable-nvenc --nvccflags='-gencode arch=compute_52,code=sm_52 -O2' --enable-libnpp --enable-gpl --enable-libx264 --enable-libmp3lame --enable-openssl --extra-cflags=-I/usr/local/cuda/include --extra-ldflags=-L/usr/local/cuda/lib64 --enable-shared --prefix=/opt/ffmpeg
libavutil      58.  2.100 / 58.  2.100
libavcodec     60.  3.100 / 60.  3.100
libavformat    60.  3.100 / 60.  3.100
libavdevice    60.  1.100 / 60.  1.100
libavfilter     9.  3.100 /  9.  3.100
libswscale      7.  1.100 /  7.  1.100
libswresample   4. 10.100 /  4. 10.100
libpostproc    57.  1.100 / 57.  1.100

You might be wondering, we started from an Ubuntu Cuda 11.8 image, why the 12.2?

What you’re seeing is the magic of Nvidia Container Toolkit. This library mounts the host’s drivers and libraries to be used inside our container. So when we run nvidia-smi what we’re seeing is the host machine’s drivers. CUDA has a certain amount of backwards compatibility so as long as nvidia-smi is working fine in and out of the container, you’re golden.

Now this was a gross simplification of runtime vs driver API, but it’s enough to continue working with FFmpeg.

Benchmarking FFmpeg: CPU vs. GPU

To test our accelerated build of FFmpeg, we’ll do some transcoding on this public video.

https://commondatastorage.googleapis.com/gtv-videos-bucket/sample/TearsOfSteel.mp4

Let’s take a look at our CPU transcoding command. We’ll be using a c7i.2xlarge instance, equipped with 8 CPU cores and 16GB of DDR5 RAM. With these cutting-edge specs this instance is well suited for CPU intensive workloads.

ffmpeg -y -i "https://commondatastorage.googleapis.com/gtv-videos-bucket/sample/TearsOfSteel.mp4" \
-reset_timestamps 1 \
-map 0:v:0 \
-vf scale=1280:-2 \
-c:v:0 libx264 \
-map 0:a:0 \
-c:a:0 libmp3lame \
"cpu.mp4"

-y Automatically overwrites existing output files
-i specifies input (either URL or path)
-reset_timestamps 1 resets timestamps to start at 0 in output
-map 0:v:0 maps the first video stream of input to the first output stream
-vf scale:1280:-2 resample the video to a 1280px width, preserving aspect ratio
-c:v:0 libx264 specifies libx264 encoder for the video stream
-map 0:a:0 maps first audio stream in input to first audio stream in output
-c:a:0 libmp3lame specifies libmp3lame for the first audio stream
“cpu.mp4” name of output file

This command took 1 minute and 57 seconds to complete, resulting in a transcoded file of 153 MB.

Now let’s take a look at our GPU version, running on a g4dn.xlarge.

ffmpeg -y \
-hwaccel cuda \
-hwaccel_output_format cuda \
-i "https://commondatastorage.googleapis.com/gtv-videos-bucket/sample/TearsOfSteel.mp4" \
-reset_timestamps 1 \
-map 0:v:0 \
-vf "scale_npp=1280:-2:interp_algo=super" \
-c:v:0 h264_nvenc \
-preset p7 \
-tune:v hq \
-rc:v vbr \
-cq:v 19 \
-b:v 0 \
-map 0:a:0 \
-c:a:0 libmp3lame \
-fps_mode passthrough \
"transcoded.mp4"

-hwaccel cuda Use hardware acceleration for decoding video
-hwaccel_output_format cuda Keeps decoded frames in GPU VRAM.
-vf scale_npp=1280:-2:interp_algo=super Same scaling but accelerated with scale_npp . interp_algo=super will use the supersampling algorithm while scaling, greatly improving image quality when downscaling.
-c:v:0 h264_nvenc Encodes video with NVENC in h264 format.
-preset p7 High-quality encoder preset
-tune:v hq Prioritizes higher quality over speed
-rc:v vbr Allows for variable bit rate
-cq:v 19 Chooses constant quality setting of 19 (lower is better, but results in a larger file size). Diminishing returns below 19.
-b:v 0 Set the bit rate to auto. This with the variable bit rate and constant quality means the encoder tries to maintain the constant quality by adjusting the bit rate as needed.
-fps_mode passthrough Prevent YUV output format and duplicate frames. In FFmpeg versions ≤ 5.1 this was -vsync 0 (Recommended option from Nvidia)

Our GPU-accelerated command took 44 seconds to complete, resulting in a transcoded file of 422 MB. That’s 2.65x faster!

All these extra argument flags (-cq:v 19, interp_algo=super) are there to increase image quality at the cost of speed/storage. For example, our 422 MB file is ~2.75x larger than our CPU result, so change settings as you see fit.

Try to guess which one is from the CPU or GPU

This one is from GPU transcoding! Hard to tell right?

During transcoding we can also query our GPU to see our usage

watch -n 1 nvidia-smi — query-gpu=utilization.gpu,utilization.encoder,utilization.decoder,power.draw,memory.total,memory.used,memory.free — format=csv

utilization.gpu [%], utilization.encoder [%], utilization.decoder [%], power.draw [W], memory.total [MiB], memory.used [MiB], memory.free [MiB]
5 %, 52 %, 15 %, 37.49 W, 15360 MiB, 174 MiB, 14756 MiB

Keep in mind the encoder and decoder are once again, separate pieces of hardware, so it’s not uncommon to see gpu utilization be near 0%.

Helpful Flags for FFmpeg:

-loglevel error Will only show error messages instead of showing everything like stream information, duration, speed, etc.
-hide_banner Hides the banner that displays the FFmpeg version, what it was built with, configuration settings, etc.
-nostdin Disables interaction with the FFmpeg process, useful in scripts.

Productionization on Kubernetes

Staying within the AWS ecosystem allows the usage of their EKS Optimized AMIs, specifically, their GPU-accelerated versions.

Simply use this command with your EKS version to grab the correct AMI ID

export EKS_VERSION=1.26
export REGION=us-east-1
aws ssm get-parameter --name /aws/service/eks/optimized-ami/${EKS_VERSION}/amazon-linux-2-gpu/recommended/image_id --region ${REGION} --query "Parameter.Value" --output text

Just be mindful that the CUDA and driver versions on that AMI must be compatible with the FFmpeg version you compile with, and also must be run on a GPU based instance (g4dn for example).

To make sure that the nodes (EC2 machine) with this GPU-optimized AMI are matched to pods that contain our accelerated FFmpeg, we’ll need to set affinities on our nodes and pods, taints on our nodes, and tolerations on our pods.

Affinities affect scheduling preference, meaning nodes that match the key & value of the pod will be favored (this can be set to not schedule if no matching nodes are available)

Taints are at the node level, containing a key and value. If these do not match, then it will not accept the pod. This is used to repel pods that don’t have matching tolerations.

Tolerations are placed at the pod level, same thing here. If they don’t match this pod will not be placed on the node.

Here we see the affinity for Pod #1 matches Node #1 so it will be scheduled there

For Pod #2 the affinity does not match Node #1 so it will not be placed there. It will be placed on Node #2 not just because the affinity matches, but because the taints and tolerations match as well.

Side note: on AWS’s installation guide it’ll tell you to install the Nvidia k8 device plugin, but this is not required. This plugin will allow you to track GPUs on your cluster and request GPUs like CPU resources, but the AMI already comes with libnvidia-container-tools, drivers, and the Nvidia runtime, which is all that’s needed to run GPU workloads. Essentially If you don’t want to download it, you don’t have to.

Closing Notes

For all its capabilities and syntax, FFmpeg might as well be its programming language. The example we’ve discussed touches on just a few basics, leaving out advanced topics like complex filter options, buffer sizes, leveraging multiple GPUs, and managing multiple streams. This guide is not designed to be a deep dive into FFmpeg, but instead a starting point. With a basic understanding of hardware acceleration and a sample Dockerfile, you’ll be well equipped and experimenting with FFmpeg rather than debugging build issues along the way.

Additionally if you have a use case for FFmpeg in general, it’s worth taking a look into an accelerated version. Not only can it provide a cost effective way to transcode, but a GPU opens a lot of doors in terms of capabilities.

With that I’ll leave you to dive into the FFmpeg world of filters and streams ✌️

Useful References

[blogpost] Nvidia’s General FFmpeg Transcoding Guide
[guide] Using FFmpeg with NVIDIA GPU Hardware Acceleration
[Q&A] Deep dive into the NVENC encoder settings
[Q&A] Deep dive into the constant rate factor
[Q&A] Guide to settings & flags on older FFmpeg versions

We’re always looking for new talent! View jobs.