Running TensorFlow (with GPU) on Kubernetes

While GPUs are a staple of deep learning, deploying on GPUs makes everything more complicated, including your Kubernetes cluster. This quick guide will walk through adding basic single-GPU support to Kubernetes.

The guide assumes that Kubernetes is already running on Ubuntu. A LTS release is preferable, with 14.04 being most preferable due to NVIDIA recommendations for driver hosts. Warning: Ubuntu 14.04 is not well supported by Kubernetes. Feel free to use a different distro. This guide also assumes that the proper GPU drivers and CUDA version have been installed. Plenty of other guides cover those topics.

TL;DR: start with nvidia-docker, then whittle away it’s functionality so that just plain docker remains. Then add that functionality to Kubernetes.

Working without nvidia-docker

A common way to run containerized GPU applications is to use nvidia-docker. Here is an example of running TensorFlow with full GPU support inside a container.

Import TensorFlow using a GPU-supported container and nvidia-docker.

Simple! If all goes well the output should look something like this:

TensorFlow successfully identified the appropriate drivers and libraries.

Unfortunately it’s not current possible to use nvidia-docker directly from Kubernetes. Additionally, Kubernetes does not support the nvidia-docker-plugin since Kubernetes does not use Docker’s volume mechanism.

The goal is to manually replicate the functionality provided by nvidia-docker (and it’s plugin). For demonstration, query the nvidia-docker-plugin REST API to query the command line arguments:

Accessing the nvidia-docker-plugin REST API which returns docker CLI flags.

Which will feed into docker, running the same python command:

If all does well, TensorFlow should find everything correctly and you should see the same output as before.

Finally, the dependency on nvidia-docker-plugin by manually specifying the driver path and manually mounting the devices and CUDA volumes.

Example of running a GPU-enabled container without nvidia-docker.

Note that this still uses nvidia-docker’s driver volume for discovery. While Kubernetes cannot call the plugin directly we can use the filesystem.

Enabling GPU devices

With the knowledge of what Docker needs to be able to run a GPU-enabled container it is straightforward to add this to Kubernetes. The first step is to enable an experiment flag on all of the GPU nodes. In the Kubelet options (found in /etc/default/kubelet if you use upstart for services), add --experimental-nvidia-gpus=1. This does two things… First, it allows GPU resources on the node for use by the scheduler. Second, when a GPU resource is requested, it will add the appropriate device flags to the docker command. This post describes a little more about what and why this flag exists:

http://blog.clarifai.com/how-to-scale-your-gpu-cloud-infrastructure-with-kubernetes

The full GPU proposal, including the existing flag and future steps can be found here:

https://github.com/kubernetes/community/blob/master/contributors/design-proposals/gpu-support.md

Pod Spec

With the device flags added by the experimental GPU flag the final step requires adding the necessary volumes to the pod spec. A sample pod spec is provided below:

GPU-enabled Kubernetes pod spec.

If set up correctly the output should match the output from running the nvidia-docker container output at the beginning:

TensorFlow successfully identified the appropriate drivers and libraries.

Conclusion

Hopefully this guide helps someone wade through these undocumented features to make use of GPUs in their cluster.

Follow me on Twitter for more posts like these. If you’d like help getting this into production, I do consulting.