Defining User Restrictions for GPUs
How to share multiple GPUs between users without OOM errors?
We are using Lambda Lab’s 4-GPU workstations at the University of Southern California. As multiple users can be simultaneously running jobs on the server, handling Out of Memory (OOM) errors and fairness is a bit challenging. The solutions that we could find online are as the following:
- NVIDIA’s GPU virtualization software which can be added on top of a virtualization hypervisor. However, it is only supported for NVIDIA Turing™, Volta™, Pascal™, and Maxwell™ GPU architectures as of now.
2. Slurm is an open-source and highly scalable job scheduling system for large and small Linux clusters. It supports GPU resource scheduling. It seems a very suitable choice for large teams. It also supports per-GPU memory allocation which enables multiple processes to run on a single GPU if there is enough memory and under-utilized compute units.
In the rest of the article, we explain what we have done for sharing multiple GPUs between users. In a multi-GPU workstation, creating user groups for each GPU driver file allows us to grant permission for using a specific GPU to a specific user. Now, only the specified GPUs will be visible to the users. We explain the commands necessary to create user restrictions on GPUs on a 4-GPU workstation.
Step 1: Create the groups and add the users to the groups. We need to create 4 groups as we have 4 GPUs:
# Adding groups
sudo groupadd nvidia0
sudo groupadd nvidia1
sudo groupadd nvidia2
sudo groupadd nvidia3# Adding users to the groups
sudo usermod -a -G nvidia0 olivia
sudo usermod -a -G nvidia1 peter
Step 2: Create a config file at /etc/modprob.d/nvidia.conf with the following content:
#!/bin/bash
options nvidia NVreg_DeviceFileUID=0 NVreg_DeviceFileGID=0 NVreg_DeviceFileMode=0777 NVreg_ModifyDeviceFiles=0
This config file will load 4 NVIDIA’s driver parameters into the Linux’s kernel. One issue with NVIDIA’s driver file is that they get regenerated after every session which destroys the user access restrictions we are going to set on the driver files. NVreg_ModifyDeviceFiles ensures that driver files are generated once and will not change in the future. The other parameters are setting a default user- and group-IDs with 777 access permission. If there is already an NVIDIA config file at /etc/modprob.d/, you can keep a backup of it and replace its content with our provided script.
Step 3: Create a script to load the NVIDIA’s driver at /etc/init.d/gpu-restriction.
#!/bin/bash
### BEGIN INIT INFO
# Provides: gpu-restriction
# Required-Start: $all
# Required-Stop:
# Default-Start: 2 3 4 5
# Default-Stop:
# Short-Description: Start daemon at boot time
# Description: Enable service provided by daemon.
# permissions if needed.
### END INIT INFOset -estart() {
/sbin/modprobe --ignore-install nvidia; /sbin/modprobe nvidia_uvm; test -c /dev/nvidia-uvm || mknod -m 777 /dev/nvidia-uvm c $(cat /proc/devices | while read major device; do if [ "$device" == "nvidia-uvm" ]; then echo $major; break; fi ; done) 0 && chown :root /dev/nvidia-uvm; test -c /dev/nvidiactl || mknod -m 777 /dev/nvidiactl c 195 255 && chown :root /dev/nvidiactl; devid=-1; for dev in $(ls -d /sys/bus/pci/devices/*); do vendorid=$(cat $dev/vendor); if [ "$vendorid" == "0x10de" ]; then class=$(cat $dev/class); classid=${class%%00}; if [ "$classid" == "0x0300" -o "$classid" == "0x0302" ]; then devid=$((devid+1)); test -c /dev/nvidia${devid} || mknod -m 660 /dev/nvidia${devid} c 195 ${devid} && chown :nvidia${devid} /dev/nvidia${devid}; fi; fi; done
}stop() {
:
}case "$1" in
start)
start
;;
stop)
stop
;;
restart)
stop
start
;;
status)
# code to check status of app comes here
# example: status program_name
;;
*)
echo "Usage: $0 {start|stop|status|restart}"
esacexit 0
Then, enter the following command which will tell the Linux to load the script after re-boots:
sudo update-rc.d gpu-restriction defaults
sudo update-rc.d gpu-restriction enable
Now, we have a Linux service which we can start by the following command:
sudo service gpu-restriction start
Reboot the machine. You are all set now!
Inspecting the script:
The core part of the script which loads the NVIDIA’s driver is:
/sbin/modprobe --ignore-install nvidia;
/sbin/modprobe nvidia_uvm;
test -c /dev/nvidia-uvm || mknod -m 777 /dev/nvidia-uvm c $(cat /proc/devices | while read major device; do if [ "$device" == "nvidia-uvm" ]; then echo $major; break; fi ; done) 0 && chown :root /dev/nvidia-uvm; test -c /dev/nvidiactl || mknod -m 777 /dev/nvidiactl c 195 255 && chown :root /dev/nvidiactl; devid=-1; for dev in $(ls -d /sys/bus/pci/devices/*); do vendorid=$(cat $dev/vendor); if [ "$vendorid" == "0x10de" ]; then class=$(cat $dev/class); classid=${class%%00}; if [ "$classid" == "0x0300" -o "$classid" == "0x0302" ]; then devid=$((devid+1)); test -c /dev/nvidia${devid} || mknod -m 660 /dev/nvidia${devid} c 195 ${devid} && chown :nvidia${devid} /dev/nvidia${devid}; fi; fi; done
The above script first will load the NVIDIA modules nvidia and nvidia_uvm. For setting the user access restriction, in a loop over all the PCI express devices, we find if their vendor and class IDs match an NVIDIA GPU card. Afterward, we create the driver file of each GPU, then we set the groups and access restrictions (660) on them. The driver files are located at /dev/nvidia0, …, /dev/nvidia3 are set.
We recommend to first run the above script in an executable file to test if it works fine. I would appreciate letting me know if you find any issues in the tutorial.