Containerising GPU workloads

Ramnath Nayak
cloudnativeinfra
Published in
4 min readJul 28, 2019
Photo by chuttersnap on Unsplash

The Containerisation Revolution

Containers have been a powerful catalyst for the adoption of automation, CI/CD and DevOps practices in general. Containers have a provided physical implementation for the abstract aspects of DevOps philosophy around role separation between Dev and Ops teams.

If you are a Developer, your focus would be on building what is inside a container. If you are an Ops person, your focus would be on deploying, scaling and monitoring to ensure that the containers are running and healthy.

The container provides a clean separation between build and deployment roles without sending them both down different paths, with the container becoming the unit of deployment and the focal point that unifies the Dev and Ops teams.

For any team looking to adopt DevOps, there is now a clear paved road to adopt where you check in your code changes to a Git repo which tirggers the build pipeline that produces a container that then gets deployed to production.

The anti-pattern

With the advent of Machine Learning and AI type workloads, the nature of the the problem domain has required that a lot of large scale mathematical operations that need to happen in parallel.

Traditional CPU architecture is not the optimal solution for these types of workloads, they run much faster in the highly parallel architecture of GPUs. For workloads that rely on GPUs, the difference in performance between GPU and CPU can be drastically different by several orders of magnitude to the extent of being indispensable.

Here lies the paradox — on one hand, you want to adopt the paved road of the container as a unit of deployment. On the other hand you need the specialisation of GPUs, which cannot be leveraged by the container, which is essentially a form virtualisation at the OS level and does not provide a direct mechanism for accessing underlying specialised hardware like GPUs.

Nvidia-Docker

Thankfully, the stalwarts at Nvidia have come up with a plugin for docker (cleverly named Nvidia Docker) that allows you to do build containers that can leverage the underlying Nvidia GPUs on the host server.

Nvidia Docker lets you do both — use the paved road for deploying with containers as well as leverage the GPUs so that your ML or graphic processing worklaods can take advantage of the GPUs.

Nvidia Docker Architecture (courtesy Nvidia)

Most software vendors of have GPU containers built using Nvidia-Docker that you can use as your base containers which you can extend. For e.g., instead of tensorflow:latest, use tensorflow:latest-gpu and your container will just be able to use the GPU without you having to do anything.

You will then need to use nvidia-docker instead of just docker to run your applications to make use of the plugin. Remember that Nvidia Docker containers do not run on vanilla docker, so if you also want cpu based version of the application, you will need to build a separate non-gpu version.

Running GPU workloads in the cloud

Most cloud providers provide deep learning images with nvidia-docker pre-installed. Just in case if you want to build your own image (with a specific version of the CUDA drivers, for example), here is an ansible script you can use to install your specific version of the Nvidia CUDA drivers, Docker and Nvidia Docker 2.

Just add your host IP address in /etc/ansible/hosts with the name gpuvm and invoke ansible-playbook gpu.yml

- name: Install CUDA, Docker and NVIDIA Docker 2
hosts: gpuvm
become: True
vars:
repo_url : "https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-repo-ubuntu1804_10.1.168-1_amd64.deb"
key_url : "https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub"
tasks:
- name: Install CUDA GPG Key
apt_key:
url: "{{ key_url }}"
state: present

- name: Install CUDA repo metadata
apt:
deb: "{{ repo_url }}"
state: present
- name: Install CUDA and pre-req packages
apt:
name: ['gcc', 'linux-headers-{{ansible_kernel}}', 'cuda']
update_cache: yes
state: present
- name: Install Docker GPG Key
apt_key:
url: https://download.docker.com/linux/ubuntu/gpg
state: present

- name: Install Docker repo
apt_repository:
repo: "deb [arch=amd64] https://download.docker.com/linux/ubuntu {{ansible_distribution_release}} stable"
state: present
- name: Install Docker and pre-req packages
apt:
name: ['apt-transport-https', 'ca-certificates', 'curl', 'software-properties-common', 'docker-ce']
update_cache: yes
state: present
- name: Add ubuntu user to docker groups
user:
name: ubuntu
groups: docker
append: yes
- name: Install NVIDIA Docker GPG Key
apt_key:
url: https://nvidia.github.io/nvidia-docker/gpgkey
state: present
- name: Install NVIDIA Docker repo
get_url:
url: https://nvidia.github.io/nvidia-docker/ubuntu{{ansible_distribution_version}}/nvidia-docker.list
dest: /etc/apt/sources.list.d/nvidia-docker.list
- name: Install NVIDIA Docker 2
apt:
name: nvidia-docker2
update_cache: yes
state: present
- name: Reboot
reboot:

References

--

--

Ramnath Nayak
cloudnativeinfra

Outbound Product Manager at Oracle Cloud Infrastructure