Journey to Deep Learning: Nvidia GPU passthrough to LXC Container

Mamy André-Ratsimbazafy
13 min readFeb 12, 2017

--

This article was first published on my personal blog at: https://andre-ratsimbazafy.com/cuda-gpu-passthrough-to-a-lxc-container/

So in the past few months I got serious with machine learning and bought a GTX 1070 for a very nice price :

  • Full name — Inno3D Ichill X4 (Herculez Airboss) GTX 1070 for the Bargain price of 350€ taxes included (full price was 470€)

I happily installed it in my headless server which is powered by Proxmox.
It’s detected and here is my 8 hours journey to pass the proverbial GPU bucket to my archlinux Machine Learning container.

Now first of all, you are lucky because due to the wow size of the Data Science Bowl 2017 data (70GB 7z file + 140 GB uncompressed), my 256GB SSD was a bit tiny, so I will document a full installation of Proxmox.

My initial setup had :

  • SSD 256GB Samsung EVO 850
  • Proxmox rootfs 50GB
  • Proxmox “local” lvm (to store iso, backups, container template) 50GB
  • LVM thin provisioning (everything else, especially my Machine Learning container)

The NAS disk (6TB) is entirely passed through to a NAS virtual machine.

Why did I had to reinstall Proxmox, couldn’t I just shrink my rootfs and local storage partition ?
Good question, I actually created my Proxmox root partition with XFS filesystem which cannot be shrinked.
I did that for performance reasons after a lengthy review of LKML mailing list and various threads on ext4 performance bug, but I guess I could have the same performance by remove barriers from ext4.

Oh well …

So here we go.

First of all, if you’re considering a headless server, I strongly recommend you to get an IPMI compatible motherboard. Supermicro is probably the best or maybe check AsRock Rack.
IPMI allows you to connect remotely to the server, even at BIOS stage, view the actual display, use your mouse/keyboard and USB devices (including portable HDD), load ISOs, and even record everything you did.
That’s also called KVM-over-IP (KVM for Keyboard Video Mouse not Kernel Virtual Machine).

I will also assume that you have an nvidia GPU, at least until Vega and the ROCm Compute platform supports Theano, Tensorflow, Caffe and Torch, Radeons just aren’t an option.
And actually it’s the proprietary driver that is problematic with the special devices it creates in /dev/*

My screenshots will use IPMI but you can reproduce the initial steps that if you have an actual monitor + keyboard connected to your server.

Step 1 — Installing Proxmox

Why Proxmox

Proxmox is a popular distribution build to manage multiple virtual machine and containers. It’s based on Debian Linux.
I choose it because I wanted:

  • Linux container virtualization for speed
  • The possibility to load VM appliances (Storage like Rockstor, Xpenology or OpenMediaVault for example)
  • GPU and PCI passthrough
  • The possibility to load a Windows VM, passthrough the GPU to it and use it like a regular PC
  • Bonus point if there was a browser-based ZFS or BTRFS management GUI for my storage need

Other similar distributions include:

  • smartOS (Solaris based), GPU passthrough seemed impossible
  • coreOS, specialized in Docker management, AFAIK cannot run regular VMs
  • Xen server, no containers
  • VMWare vSphere, no containers

Unfortunately though smartOS and Proxmox support ZFS, none had a nice GUI to manage my disks.

Small note: I wasn’t aware of Kubernetes 1 year ago, but it has no KVM anyway.

Part 1: Installing Proxmox

Go to https://www.proxmox.com/en/ and download the latest ISO.

Before starting the server, load the Proxmox ISO.
With a SuperMicro motherboard, you only need to enter your IPMI IP in your browser which can be done even if it’s off, as long as there is power, IPMI is on and can remotely power the server.
To find it you can do an IP scan (google that).

You can also use their java-based client here as an alternative.
It will scan for the proper IP automatically so that’s a win.
However the remote viewing doesn’t work for me so I use the web client, your mileage may vary.

On the web client you will have the following screen :

We can connect the iso directly there but I usually clic on the black screen to start a remote connection and connect the ISO from there:

Don’t forget to click on “Plug In”.

Start the server, load the boot menu (F11 in my case cf your BIOS POST)

If your motherboard support UEFI: ATEN Virtual CDROM YSOJ, choose the UEFI option to load the ISO.

You should have the following screen. If yes, congrats, take a beer, you just managed to boot a server without a mouse, keyboard, ISO cd or USB or screen.

Now let me remind you, YOU WILL LOSE ALL YOUR DATA IN THE DISK YOU INSTALL PROXMOX IN.
Disclaimer : I am not responsible of any data loss, backup your data.

Next, you choose Install Proxmox VE and you come to your first IPMI bug:

The “Accept” button is not there.
I whipped up my trusty VMWare Fusion :

It’ a “I agree button” !! and the shortcut is “Alt + G”

Next is choosing your installation media, don’t choose the wrong target disk.

Filesystem-wise you have default ext4, ext3, xfs and zfs in RAID0, RAID1, RAID10, RAIDZ-1 (equivalent to RAID5), RAIDZ-2 (equivalent to RAID6), RAIDZ-3.

Stay with the default ext4, it’s the most used FS, so support is always improving. If you’re curious like me check on Google with a combination of “ext4 performance slow degradation” to make sure your performance/risk ratio is best.
ZFS is very nice for a file server, except that there is no GUI for its advanced capabilities in Proxmox.
Ext3 is tried and true but it doesn’t support SSD wear-leveling and it’s the only choice that doesn’t support extents for large files.
XFS is great, except that if you later want to shrink your filesystem you’ll be out of luck like I was.
Do your homework on RAID ad RAIDZ.

Now regarding the rest, you probably want to maximize the space for your Machine Learning VM to hold the GBs of data.
Lesson learned: 256GB is not enough for Machine Learning.
I do have lots of HDDs I could use but they will be slow to load/store data, I don’t even want to uncompress a 70GB .zip file on a HDD.

Swap is what the system use when it’s out of memory, this is useful if you overprovision your VMs and container memory. If there is no more RAM + Swap Linux starts hunter mode and kills random processes to free memory, you don’t want that to happen in the middle of your computation.
My system currently has 16GB of RAM (and can go up to 64GB). Besides my machine learning VM and Proxmox, the only ones that will be running non-stop won’t need more than 4GB, so 4GB it is.
Regarding maxroot, minfree, etc, documentation is there.

Next screen will be for your timezone, next one for your password.
Since the display issue is still here, use tab to validate the offscreen “Next button”

You will then see the installation screen with Proxmox features.

After 2 minutes, installation ended successfully with this screen:

Press enter to reboot. You can now remove the ISO.

After reboot you finally canuse the Web interface.
Let’s say the static IP and hostname you choose were 192.168.1.10 and pve.home (the default one),
in your favorite browser type either https://192.168.1.10:8006 or https://pve.home:8006

Login with root and your password.

Aaaaannnd congratulations !

Part 2: Preparing Proxmox for Nvidia GPU/CUDA passthrough

/offtopic mode on
Now if you came to this article through Google, you probably saw online
that you have to use the same OS for the container as for the host (i.e. Debian), or that the permission for
/dev/nvidia0, /dev/nvidiactl, /dev/nvidia-uvm should be nogroup:nobody.

This is wrong, and I lost my Sunday on this, but at least I can use Archlinux
instead of Debian in my container and get that back in term of customization,
compilation and maintenance ease.

Linux is Linux because of the kernel, everything else is flavor (a.k.a userland).
nobody:nogroup is what is displayed in the container if a group:user
of a shared folder/file exists in the host but not inside the container.
I’m pretty sure having 123:456 as group:user in the host (aka nogroup:nobody in the container) won’t help for GPU passthrough.

/ontopic mode on
Now let’s get to business, we will need the command line and a container

Container
Go in your local storage

Click on template, and you will be able to download a container for your favorite distro (from gentoo to centos)

I will use Archlinux for myself.
The only thing you need to remember is: You do not need to install the Nvidia and CUDA kernel drivers/modules.
They will be installed in the host and passed to the container.
Actually everything that has to do with the kernel must be done at host level (linux-headers needed included).

Create your container via the Create CT button on the top right.

Follow the steps, don’t forget to change the CPU and RAM to the max possible. You can overprovision your RAM provided you have the swap.
You can check online for further details if needed.

Once your container is ready you can start it to check if everything is working.
If yes, congrats, you virtualized an OS through containerization. You are now Docker without being Docker.

Please note that GPU passthrough will also work for unprivileged containers.
And multiple containers can access the GPU seemlessly.

Host GPU configuration

Let’s go to Proxmox command line

There are two ways, either click on one of the Shell button.

Or SSH in the server (which is enabled by default)

We will follow the instructions on Debian wiki for Nvidia and adapt them for our case.

Open /etc/apt/sources.list with your favorite editor, nano or vi. If you never used vi, this is the wrong time to start.

$ nano /etc/apt/sources.list

Add PVE (Proxmox) free repository, it’s needed to get the kernel headers to compile the nvidia module.
Add Jessie backport to get the latest nvidia driver

# security updates
deb http://security.debian.org jessie/updates main contrib

# PVE pve-no-subscription repository provided by proxmox.com,
# NOT recommended for production use
deb http://download.proxmox.com/debian jessie pve-no-subscription
# jessie-backports
deb http://httpredir.debian.org/debian jessie-backports main contrib non-free

Save the file.

Load the package listing
Update all packages to latest version (especially the kernel)

$ apt-get update
$ apt-get dist-upgrade

If kernel was updated, reboot


$ shutdown -r now

Reopen a console
Verify your kernel version
Verify available headers

$ uname -r
$ apt-cache search pve-header

Install the corresponding header, which should be the latest since you dist-upgraded and restarted.

$ apt-get install pve-headers-4.4.35–2-pve

Install nvidia driver from backport

$ apt-get install -t jessie-backports nvidia-driver

Every mandatory packages is set, I suggest to install monitoring tools, namely nvidia-smi, i7z, htop, iotop and lm-sensors:

  • i7z is a monitoring tool for Intel CPU (temperature, freq)
  • nvidia-smi for NVIDIA GPU, that is actually what we will use to confirm GPU setup
  • htop is a ncurses interface top (CPU, RAM monitoring)
  • iotop to monitor disk access speed and potential contention
$ apt-get install i7z nvidia-smi htop iotop

Last thing before the last restart fo your server life:
Nvidia driver will create special files in /dev/* called:

  • /dev/nvidia0 : corresponding to the GPU (second GPU would get /dev/nvidia1)
  • /dev/nvidiactl : not really sure what it’s for, control?
  • /dev/nvidia-uvm : the CUDA driver

and sometimes you will get:

  • /dev/nvidia-uvm-tools
  • /dev/nvidia-modeset

Those files are created when the nvidia module is loaded (when an application access the GPU).
NVIDIA module must be loaded in the host before being usable in a container.
So to make sure it’s loaded at boot, edit /etc/modules-load.d/modules.conf

$ nano /etc/modules-load.d/modules.conf

Add nvidia and nvidia_uvm

# /etc/modules: kernel modules to load at boot time.

# This file contains the names of kernel modules that should be loaded
# at boot time, one per line. Lines beginning with “#” are ignored.
nvidia
nvidia_uvm

Then update the initramfs so it takes the new module into account

update-initramfs -u

Then, for some reason nvidia and nvidia_uvm do not automatically create the node in /dev/*
It only create them when X server or nvidia-smi is used so add the following in /etc/udev/rules.d/70-nvidia.rules:

# /etc/udev/rules.d/70-nvidia.rules# Create /nvidia0, /dev/nvidia1 … and /nvidiactl when nvidia module is loaded
KERNEL==”nvidia”, RUN+=”/bin/bash -c ‘/usr/bin/nvidia-smi -L && /bin/chmod 666 /dev/nvidia*’”
# Create the CUDA node when nvidia_uvm CUDA module is loaded
KERNEL==”nvidia_uvm”, RUN+=”/bin/bash -c ‘/usr/bin/nvidia-modprobe -c0 -u && /bin/chmod 0666 /dev/nvidia-uvm*’”

Reboot the server.

Open a new commandline

$ nvidia-smi

should give you a similar output and create the device in /dev/*:

Note the cgroup 195 for those nodes, we will need to allow read-write access to the containers to those cgroups.
(Remember what I said about nogroup:nobody being nonsense, it’s just a cgroup parameter issue)

Lastly we need the cgroup for CUDA, we can get it by loading the module:

$ modprobe nvidia-uvm
$ ls /dev/nvidia* -l
crw-rw-rw- 1 root root 243, 0 Jan 16 02:20 /dev/nvidia-uvm
crw-rw-rw- 1 root root 195, 0 Jan 16 02:16 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Jan 16 02:16 /dev/nvidiactl

cgroup is 243.
Please note that from time to time, it changed after I installed some packages.
If CUDA stopped at one point that may be the place to check.

Last step — Container config
Your container config will be in /etc/pve/lxc
If you container ID is 100, open the 100.conf file

Add the lines after the GPU Passthrough comment
This will map the card from host to the container and allow access via the cgroup lines.


# Deep Learning Container (CUDA, cuDNN, OpenCL support)

arch: amd64
cpulimit: 8
cpuunits: 1024
hostname: MachineLearning
memory: 16384
net0: bridge=vmbr0,gw=192.168.1.1,hwaddr=36:39:64:66:36:66,ip=192.168.1.200/24,name=eth0,type=veth
onboot: 0
ostype: archlinux
rootfs: local-lvm:vm-400-disk-1,size=192G
swap: 16384
unprivileged: 1


# GPU Passthrough config
lxc.cgroup.devices.allow: c 195:* rwm
lxc.cgroup.devices.allow: c 243:* rwm
lxc.mount.entry: /dev/nvidia0 dev/nvidia0 none bind,optional,create=file
lxc.mount.entry: /dev/nvidiactl dev/nvidiactl none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-modeset dev/nvidia-modeset none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-uvm-tools none bind,optional,create=file

Restart your container

You should now have the following on the host:

$ /dev/nvidia* -l
crw-rw-rw- 1 root root 243, 0 Jan 16 21:05 /dev/nvidia-uvm
crw-rw-rw- 1 root root 195, 0 Jan 16 21:05 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Jan 16 21:05 /dev/nvidiactl

And on the container :


$ ls /dev/nvidia* -l
-rw-r — r — 1 root root 0 16.01.2017 20:11 /dev/nvidia-modeset
crw-rw-rw- 1 nobody nobody 243, 0 16.01.2017 20:05 /dev/nvidia-uvm
-rw-r — r — 1 root root 0 16.01.2017 20:11 /dev/nvidia-uvm-tools
crw-rw-rw- 1 nobody nobody 195, 0 16.01.2017 20:05 /dev/nvidia0
crw-rw-rw- 1 nobody nobody 195, 255 16.01.2017 20:05 /dev/nvidiactl

Install the following oi your container:

  • CUDA
  • cuDNN
  • nvidia-smi / nvidia-utils
  • cnnmem

You do not need to install the NVIDIA driver unless your distro forces you too.

Here are my NVIDIA / CUDA packages on my Arch container


$ pacman -Qs nvidia
local/cuda 8.0.44–3
NVIDIA’s GPU programming toolkit
local/libcudnn 5.1.5–1
NVIDIA CUDA Deep Neural Network library
local/libvdpau 1.1.1–2
Nvidia VDPAU library
local/libxnvctrl 375.26–1
NVIDIA NV-CONTROL X extension
local/nvidia-settings 375.26–1
Tool for configuring the NVIDIA graphics driver
local/nvidia-utils 375.26–2
NVIDIA drivers utilities
local/opencl-nvidia 375.26–2
OpenCL implemention for NVIDIA
local/pycuda-headers 2016.1.2–6
Python wrapper for Nvidia CUDA
local/python-pycuda 2016.1.2–6
Python wrapper for Nvidia CUDA
$ pacman -Qs cuda
local/cnmem 1.0.0–1
A simple memory manager for CUDA designed to help Deep Learning frameworks manage memory
local/cuda 8.0.44–3
NVIDIA’s GPU programming toolkit
local/libcudnn 5.1.5–1
NVIDIA CUDA Deep Neural Network library
local/pycuda-headers 2016.1.2–6
Python wrapper for Nvidia CUDA
local/python-pycuda 2016.1.2–6
Python wrapper for Nvidia CUDA

You can now test with nvidia-smi that the GPU is properly passed through.
nvidia-smi / nvidia-utils must be of the matching version as the host nvidia driver (in my case 375.26)

I recommend to block the updates of nvidia/cuda related stuff on the host and the container to keep packages in sync.

Now you can configure your environment (Numpy, Pandas, Openblas, MKL, Scikit-learn, Jupyter, Theano, Keras, Lasagne, Torch, TensorFlow, YouNameIt).

For example using the script on Theano page you can confirm GPU acceleration with cuDNN:

$ python gpu_test.py
Using gpu device 0: GeForce GTX 1070 (CNMeM is enabled with initial size: 95.0% of memory, cuDNN 5105)
/usr/lib/python3.6/site-packages/theano/sandbox/cuda/__init__.py:600: UserWarning: Your cuDNN version is more recent than the one Theano officially supports. If you see any problems, try updating Theano or downgrading cuDNN to version 5.
warnings.warn(warn)
[GpuElemwise{exp,no_inplace}(<CudaNdarrayType(float32, vector)>), HostFromGpu(GpuElemwise{exp,no_inplace}.0)]
Looping 1000 times took 0.294226 seconds
Result is [ 1.23178029 1.61879349 1.52278066 …, 2.20771813 2.29967761
1.62323296]
Used the gpu

--

--

Mamy André-Ratsimbazafy

Data Scientist, Ethereum & Opensource dev, Go player, ex-finance and non-profits| @m_ratsim @GoAndStrat