Custom Installing NVIDIA Drivers for GKE

I was recently working with ffmpeg and NVIDIA T4 GPUs on GKE for a encoding pipeline. To get started with GPUs on GKE, the NVIDIA drivers need to be installed on the nodes. After installing, ffmpeg should be able to access NVIDIA GPU capabilities like nvenc, nvdec , yadif_cuda, etc. One of the filters we needed was scale which there is a GPU accelerated version called scale_npp.

The problem

Unfortunately, scale_npp produced corrupt video when used.

This turns out that the drivers install with the daemonset provided by GKE is version 410.79 and has some problems with NVIDIA T4 GPUs. Running the same commands on a NVIDIA Quadro RTX 5000 with the same drivers produced non-corrupt video.

The incompatibility seems to be when scale_npp is actually needs to do scaling. If the source dimensions equal the output dimensions, the video is non-corrupt. When the source dimensions are not equal the output dimensions, the video is corrupt.

The Fix

Taking a look at the daemonset, the driver installation process is

  1. Attempt to download and install a precompiled driver
  2. Fallback to compiling and installing (This always errored)

The precompiled driver was hardcoded to a preformated download location on a Google Cloud Storage bucket. It would take the region and the default driver version specified in a script and attempted to download it.

Fortunately, the driver version could be configured from the env of the daemonset, but exact driver version would needed to be provided.

Running gsutil ls gs://nvidia-drivers-asia-public/tesla , you will be able to list the downloadable drivers by version number.


After choosing the driver version you want, 440.64.00 in our case, make a copy of the daemonset and set the the environment variableNVIDIA_DRIVER_VERSION=440.64.00 and re-install the daemonset.

The Adventures of Me