Motivation
The School of Computing provides GPU computing resources on Ubuntu machines. However, users are not provided with sudo privileges and therefore cannot install software packages. This means that the GPU resources cannot be effectively exploited for training deep learning models.
A common solution is to manually compile the required packages, but this leads to dependency hell and takes a lot of time for the developer to solve compilation problems.
To this end, I will introduce JuNest, which uses Linux namespaces to create a partially isolated Linux environment so that root privileges are no longer required to install packages.
Advantages of using JuNest
- Can install packages without sudo privilege
- Packages are released on a rolling basis, so that the latest versions are always available
- JuNest is an Arch Linux-based distribution, so Arch Linux users will feel comfortable because of the familiarity
Disadvantages of using JuNest
- Takes time to configure the GPU driver (to be covered later)
- Takes time to install an SSH server in the JuNest environment (to be covered later)
In short, if you have to manually compile many packages, JuNest is a better choice.
Step 1: Install JuNest and enter the JuNest environment
Download and install JuNest:
wget https://github.com/fsquillace/junest/archive/refs/heads/master.zip
unzip master.zip
rm -f master.zip
mkdir -p ~/.local/share
mv junest-master ~/.local/share/junest
export PATH=~/.local/share/junest/bin:$PATH
junest setup
Enter the JuNest environment:
junest
Step 2: Configure the JuNest environment
P.S.: JuNest is based on Arch Linux, which uses pacman
as the package manager. Before you proceed, you may want to get started with pacman
commands.
After you enter the JuNest environment, you will have the sudo privilege since it is an isolated environment:
sudo whoami
If the command prints root
, you can proceed the following configuration.
Modify the mirror URL to speed up the download of packages (based on your location):
sudo echo 'Server = https://download.nus.edu.sg/mirror/archlinux/$repo/os/$arch' > /etc/pacman.d/mirrorlist
Refresh package list:
sudo pacman -Syyu
sudo pacman -S archlinux-keyring
Install necessary packages:
sudo pacman -S --ignore sudo glibc base wget exa openssh neofetch ldns make autoconf unzip base-devel byobu man htop zsh aria2 mosh vim nano bind git python
sudo pacman -S --ignore sudo linux-headers cuda nvidia-dkms nvidia-utils opencl-nvidia
There will be a prompt sudo is in IgnorePkg/IgnoreGroup. Install anyway? [Y/n] n
. At this point you need to type n
and press Enter, while at other times you just need to press Enter.
Install yay for installing AUR packages:
pacman -S --needed git base-devel
git clone https://aur.archlinux.org/yay.git
cd yay
makepkg -si
Generate locale for the JuNest environment:
The locale of the JuNest environment need to be the same as the host, otherwise the command line appears garbled sometimes.
First check the current locale:
locale
On my current machine, the locale is set to en_SG.UTF-8
, so we need to generate the same locale.
sudo nano /etc/locale.gen
Press Ctrl+W for searching in the editor, then type en_SG.UTF-8
. After locating the corresponding line, uncomment it by removing the leading #
. Then press Ctrl+X to save and exit.
Generate locale:
sudo locale-gen
Install Oh My Zsh (for a better Shell environment):
sh -c "$(curl -fsSL https://raw.github.com/ohmyzsh/ohmyzsh/master/tools/install.sh)"
Step 3: Configure the GPU driver
The reason why we need to configure the Nvidia GPU driver is that JuNest is an Arch Linux-based distribution, which is a rolling distribution, meaning that the packages are always up-to-date, so the version of the Nvidia driver in JuNest is usually higher than the host environment. The mismatch of the version number breaks GPU in Python:
>>> import torch
>>> torch.cuda.is_available()
/usr/lib/python3.10/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination (Triggered internally at /build/python-pytorch/src/pytorch-1.9.0-opt-cuda/c10/cuda/CUDAFunctions.cpp:115.)
return torch._C._cuda_getDeviceCount() > 0
False
The solution is to downgrade the Nvidia GPU driver in JuNest to the match the version in the host environment.
Check the host Nvidia GPU driver version:
cat /proc/driver/nvidia/version
Find the version string in the output. On my current machine, the version is 495.29.05
.
Then check the Nvidia GPU driver version in JuNest:
pacman -Qi nvidia-utils
Find the version string in the output. On my current machine, the version is 515.65.01
. Therefore, there is a mismatch between the two versions, and we need to downgrade the Nvidia GPU driver version in JuNest. The three packages to be downgraded are nvidia-dkms
, nvidia-utils
and opencl-nvidia
.
Now comes the tricky part. There are many ways to downgrade the packages, each with some complexity, so we need to learn some prerequisites.
Rough steps for downgrading the packages
Step 1. Check if the target version exists in the history of the Arch Linux package repository
(a) Exists: go to step 2
(b) Not exist: go to step 3
Step 2: Downgrade the packages directly by the `downgrade` command
(a) Successful: Done
(b) Unsuccessful: go to step 6
Step 3: Checkout the `PKGBUILD` of the closest version in the Arch Linux package repository
Step 4: Modify the version string in the `PKGBUILD` to download the proper version of the driver
Step 5: Run `makepkg -s` and install the package
(a) Successful: Done
(b) Unsuccessful: go to step 6
Step 6: Downgrade the Linux kernel to match the host environment, and re-run the original command
Downgrade directly
Install downgrade
:
yay -S downgrade
Show a list of the past versions:
sudo downgrade nvidia-dkms
If the target version exists, press Ctrl+C to cancel, then downgrade the three packages in one command:
sudo downgrade nvidia-dkms nvidia-utils opencl-nvidia
Add packages to the ignore list to prevent update in the future:
sudo nano /etc/pacman.conf
Find the line that contains IgnorePkg
, then change to:
IgnorePkg = nvidia-dkms nvidia-utils opencl-nvidia cuda
Manually build the packages with PKGBUILD
Navigate to the nvidia-dkms page in the Arch Linux package repository. On the top-right corner, click ‘Source Files’ to head to the GitHub repository. Then click ‘History’ to show the version history.
Find the commit SHA of closest version (assume that the SHA is a9bd057f15d5764d432a73c187b2c3ff715d0732
):
mkdir tmp ; cd tmp ; git init ; git remote add origin https://github.com/archlinux/svntogit-packages.git ; git fetch origin packages/nvidia-utils ; git reset --hard a9bd057f15d5764d432a73c187b2c3ff715d0732 ; cd .. ; mv tmp/trunk nvidia-utils ; rm -rf tmp ; cd nvidia-utils
Inside the directory there is a PKGBUILD
script. Now we are going to modify the version string to download the proper version of the driver.
nano PKGBUILD
After opening the editor, there are two things worth noting here:
- We need to modify
pkgver
to the target version (on my current machine it is459.29.05
) - After modifying
pkgver
, the correspondingsource
URL will change, so thesha512sums
will change accordingly. In this case, we can disable the check by substituting the original sha512sum to'SKIP'
for simplicity
Run the PKGBUILD
script:
makepkg -s
If the build is successful, you can install the packages:
sudo pacman -U *.tar.zst
Downgrade the Linux kernel if necessary
The installation may fail because the old GPU driver may not be compatible with the latest version Linux kernel, which is the default choice of Arch Linux. If this happens, you need to downgrade the Linux kernel as well.
Outside the JuNest environment, check the Linux kernel version of the host:
uname -r
On my current machine, the output is 5.4.0–126-generic
, so that we know that the GPU driver version 459.29.05
is compatible with Linux kernel 5.4. Note that the kernel version may not be exactly the same as the host.
Then we need to downgrade the Linux kernel in JuNest.
sudo nano /etc/pacman.conf
At the end of the file, add these two lines to add the kernel-lts
unofficial user repository:
[kernel-lts]
Server = https://repo.m2x.dev/current/$repo/$arch
Refresh the package list:
sudo pacman -Sy
List all the available Linux kernel versions:
sudo pacman -Ss 'linux-lts.*-headers'
According to the output, we know that Linux kernel 5.4 is available, so we can install it by:
sudo pacman -S linux-lts54 linux-lts54-headers
Step 4: Connect to JuNest on your local device using Mosh
Assume that your original SSH config looks like this:
Host xgpe9
User ayaka
Hostname xgpe9
Port 22
Then you can start a Mosh connection to the JuNest environment directly by:
mosh --server 'PATH=/home/ayaka/.local/share/junest/bin:$PATH TZ=Asia/Singapore LANG=en_SG.UTF-8 SHELL=/usr/bin/zsh junest -- mosh-server' xgpe9 -- byobu
What the command does:
- Connect to the host server via SSH
- Open a Mosh server in the JuNest environment to accept connections
- Set the timezone in the Mosh server to Asia/Singapore (change the value according to your timezone)
- Set the locale to
en_SG.UTF-8
(change the value according to what you see in the previous step about locale) - Set the Shell to Zsh
- On your local device, connect to the Mosh server in the JuNest environment
Step 5: Install and run an SSH server in the JuNest environment
So far, the method we use to enter the JuNest environment is to SSH into the host and run the junest
command. However, this does not support some tools that assist development on servers such as VSCode Remote-SSH. Therefore, it is useful to run an SSH server in the JuNest environment so that we can SSH directly into it.
Download the PKGBUILD
script of openssh, then follow these steps:
After installation, configure the custom SSH:
mkdir ~/custom_ssh
ssh-keygen -f ~/custom_ssh/ssh_host_rsa_key -N '' -t rsa
ssh-keygen -f ~/custom_ssh/ssh_host_dsa_key -N '' -t dsa
cd ~/custom_ssh
Create the SSH config file:
nano ~/custom_ssh/sshd_config
Add the following contents:
Port 2222
HostKey /home/ayaka/custom_ssh/ssh_host_rsa_key
HostKey /home/ayaka/custom_ssh/ssh_host_dsa_key
AuthorizedKeysFile .ssh/authorized_keys
ChallengeResponseAuthentication no
Subsystem sftp /usr/lib/ssh/sftp-server
PasswordAuthentication yes
(You need to change /home/ayaka
to your home directory, and Port 2222
to your desired port.)
Run the SSH server:
/usr/bin/sshd -Def ~/custom_ssh/sshd_config
List all the open TCP ports to verify that the SSH server is running:
sudo ss -ntlp
Modify the SSH settings on your own device:
nano ~/.ssh/config
Add the following server configuration:
Host xgpe9-junest
User ayaka
Hostname xgpe9
Port 2222
Connect to the JuNest environment using VSCode Remote SSH:
F1 → Connect to Host → xgpe9-junest