I was recently working with ffmpeg and NVIDIA T4 GPUs on GKE for a encoding pipeline. To get started with GPUs on GKE, the NVIDIA drivers need to be installed on the nodes. After installing, ffmpeg should be able to access NVIDIA GPU capabilities like
yadif_cuda, etc. One of the filters we needed was
scale which there is a GPU accelerated version called
scale_npp produced corrupt video when used.
This turns out that the drivers install with the daemonset provided by GKE is version 410.79 and has some problems with NVIDIA T4 GPUs. Running the same commands on a NVIDIA Quadro RTX 5000 with the same drivers produced non-corrupt video.
The incompatibility seems to be when
scale_npp is actually needs to do scaling. If the source dimensions equal the output dimensions, the video is non-corrupt. When the source dimensions are not equal the output dimensions, the video is corrupt.
Taking a look at the daemonset, the driver installation process is
- Attempt to download and install a precompiled driver
- Fallback to compiling and installing (This always errored)
The precompiled driver was hardcoded to a preformated download location on a Google Cloud Storage bucket. It would take the region and the default driver version specified in a script and attempted to download it.
Fortunately, the driver version could be configured from the
env of the daemonset, but exact driver version would needed to be provided.
gsutil ls gs://nvidia-drivers-asia-public/tesla , you will be able to list the downloadable drivers by version number.
After choosing the driver version you want,
440.64.00 in our case, make a copy of the daemonset and set the the environment variable
NVIDIA_DRIVER_VERSION=440.64.00 and re-install the daemonset.