Build PyTorch from Source with CUDA 12.2.1 with Ubuntu 22.04

Zhanwen Chen
Repro Repo
Published in
20 min readAug 9, 2023

--

The NVIDIA 535 driver provides excellent backward compatibility with CUDA versions. Meanwhile, as of writing, PyTorch does not fully support CUDA 12 (see their CUDA 12 support progress here). Today, we are going to learn how to go from zero to building the latest PyTorch with CUDA 12.2.

There have been notable improvements in the CUDA/cuDNN ecosystem. For one, the runfiles are better, and the GPG keys now work.

0. (Optional) Remove previous cuDNN, CUDA, and NVIDIA driver installations.

If you want need to delete the old versions (the CUDA runfile installer requires removal), here are the steps to remove cuDNN first, and then CUDA, and lastly the NVIDIA-driver (you may or may not need to remove the driver depending on the CUDA vs. NVIDIA driver compatibility matrix).

# 1. Uninstall CUDNN
sudo apt remove libcudnn8 libcudnn8-dev libcudnn8-samples
sudo rm -rf /usr/src/cudnn_samples_v8

# 2. Uninstall CUDA
sudo /usr/local/cuda/bin/cuda-uninstaller # for previous runfile installations
sudo apt purge "*cuda*" "*cublas*" "*cufft*" "*cufile*" "*curand*" \
"*cusolver*" "*cusparse*" "*gds-tools*" "*npp*" "*nvjpeg*" "nsight*" "*nvvm*" # Run this anyway
sudo rm -rf /usr/local/cuda-11.8/ # Remove the OLD version. Change 11.8 to 12.0 or whatever version you want to remove.

# Uninstall NVIDIA driver
sudo /usr/bin/nvidia-uninstall # for previous runfile installations
sudo apt remove "*nvidia*" "libxnvctrl*"

# Cleanup apt
sudo apt clean
sudo apt autoclean
sudo apt autoremove

# Reboot for the changes to take effect
sudo reboot
  1. Install NVIDIA Driver 535

If you are on Ubuntu Desktop (with GUI), you can use the Additional Drivers interface for installing 535 (use the server-only option if you are installing this on a server instead of a desktop). If you do this, you can skip the rest of part 1.

First, update your system dependencies.

sudo apt update # optional but recommended
sudo apt full-upgrade # optional but recommended

Then restart your computer with

sudo reboot

and see if your NVIDIA cards are successfully detected:

nvidia-smi

You should see something like

Wed Aug  9 15:18:21 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A5000 On | 00000000:0F:00.0 Off | Off |
| 30% 29C P8 15W / 170W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA RTX A5000 On | 00000000:1F:00.0 Off | Off |
| 30% 29C P8 17W / 170W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA RTX A5000 On | 00000000:22:00.0 Off | Off |
| 30% 32C P8 19W / 170W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA RTX A5000 On | 00000000:25:00.0 Off | Off |
| 30% 32C P8 20W / 170W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 4 NVIDIA RTX A5000 On | 00000000:26:00.0 Off | Off |
| 30% 31C P8 23W / 170W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 5 NVIDIA RTX A5000 On | 00000000:88:00.0 Off | Off |
| 30% 29C P8 18W / 170W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 6 NVIDIA RTX A5000 On | 00000000:98:00.0 Off | Off |
| 30% 32C P8 19W / 170W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 7 NVIDIA RTX A5000 On | 00000000:9B:00.0 Off | Off |
| 30% 32C P8 18W / 170W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 8 NVIDIA RTX A5000 On | 00000000:9E:00.0 Off | Off |
| 30% 30C P8 19W / 170W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 9 NVIDIA RTX A5000 On | 00000000:9F:00.0 Off | Off |
| 30% 32C P8 24W / 170W | 1MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+

2. Install CUDA

Install with runfile is actually better in this case because the local/network .deb approach requires some serious wrangling of Ubuntu PGP keys which is extremely frustrating and can break your apt lists so you won’t be able to update/upgrade your apt source lists ever again.

Uninstall the old version with

sudo /usr/local/cuda-11.8/bin/cuda-uninstaller
  1. First, download the correct CUDA 12.2 installer for your system. For example, for Ubuntu 22.04 x86_64, it is:
wget https://developer.download.nvidia.com/compute/cuda/12.2.1/local_installers/cuda_12.2.1_535.86.10_linux.run

Then install it with

sudo sh cuda_12.2.1_535.86.10_linux.run

Accept, but on the next screen, unselect the older driver. The installation process is silent and takes a while.

Last, edit your environment to include your CUDA binaries. Add the following line to the bottom of your .bashrc file:

export PATH="/usr/local/cuda/bin:$PATH"
export LD_LIBRARY_PATH="/usr/local/cuda/lib64:$LD_LIBRARY_PATH"

And then refresh your environment with

source ~/.bashrc

(Optional). cuda-12.2.conf already exists, so be careful of specific version numbers.

sudo bash -c "echo /usr/local/cuda/lib64 > /etc/ld.so.conf"
sudo ldconfig

Notice that etc/ld.so.conf.d/cuda.12.2.conf also adds a static 12.2 link to libraries path. You may need to delete it to make sure cuda always points to the correct symlinked version controlled by the /usr/local/cuda symlink.

You can see that this ldconfig worked by

ldconfig -p | grep cuda

Check the number of available CUDA libraries should be 66:

ldconfig -p | grep cuda | wc -l
  1. First, verify the CUDA version
cat /usr/local/cuda/version.json | grep version -B 2 | grep SDK -B 3 -A 2

It should say something about 12.2:

   "cuda" : {
"name" : "CUDA SDK",
"version" : "12.2.20230726"
--

2. Verify CUDA 12.2 Installation

Samples are no longer under /usr/local/cuda/samples. It is now on GitHub at https://github.com/nvidia/cuda-samples. Then make sure you have the correct version of the samples. Otherwise, you may encounter some errors

git clone --single-branch --branch v12.2 https://github.com/NVIDIA/cuda-samples.git

Now go to it:

cd cuda-samples/

First, install the FreeImage dependency for the code samples.

sudo apt install cmake pkg-config libfreeimage-dev

Get your compute capability number from https://developer.nvidia.com/cuda-gpus. For example, A5000 has 8.6, which translates to 86. Compile the samples

# using all your available cores
make clean && make SMS="86" TARGET_ARCH=x86_64 -j$(nproc) > compile.log 2>&1 &

Then monitor the compilation process:

tail -f compile.log

Note that this compilation may take a minute. Once the bottom of the log ends with “Finished building CUDA samples,” check the log to ensure no errors. If successful, let’s run some compiled samples:

DeviceQuery

~/cuda-samples/Samples/1_Utilities/deviceQuery/deviceQuery

The last line should say Result = PASS.

MatrixMultiplication

~/cuda-samples/Samples/4_CUDA_Libraries/matrixMulCUBLAS/matrixMulCUBLAS

You should see something like

[Matrix Multiply CUBLAS] - Starting...
GPU Device 0: "Ampere" with compute capability 8.6

GPU Device 0: "NVIDIA RTX A5000" with compute capability 8.6

MatrixA(640,480), MatrixB(480,320), MatrixC(640,320)
Computing result using CUBLAS...done.
Performance= 8011.13 GFlop/s, Time= 0.025 msec, Size= 196608000 Ops
Computing result using host CPU...done.
Comparing CUBLAS Matrix Multiply with CPU results: PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

p2pBandwidthLatencyTest:

If you have multiple GPUs, you might be interested in seeing their communication links.

~/cuda-samples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest/p2pBandwidthLatencyTest

To make sense of the result, see your NVLink matrix by:

nvidia-smi nvlink -c

3.4 Cleanup

rm -rf ~/cuda_12.2.1_535.86.10_linux.run ~/cuda-samples

3. Install cuDNN 8

3.1 Download cuDNN 8

First, download the deb file from NVIDIA. This is easy if you do this on a Ubuntu Desktop or another Linux distribution with a web browser. Go to the NVIDIA Current cuDNN Version Download page. If the current version is no longer 8.x.x, go to the cuDNN archive page to download previous cuDNN versions.

But if you are on a Ubuntu Server, you must use your logged-in cookie because cuDNN downloads are behind NVIDIA’s authentication wall. Go to your Chrome (you might be able to do this with another browser, but I prefer Chrome’s dev tools). Login to the cuDNN archive page. And turn on your “Network” tab. Click the download link for the Tar file for your architecture (most likely x86_64). Copy as cURL (see https://stackoverflow.com/a/42028789). Make sure you are not doing “Copy all as cURL.” Paste into your terminal but with the extra output option --output cudnn.deb:

# Do not type this unmodified. This is just an example!
curl 'https://developer.download.nvidia.com/compute/cudnn/secure/8.9.4/local_installers/12.x/cudnn-local-repo-ubuntu2204-8.9.4.25_1.0-1_amd64.deb?YOUR_OWN_HASH' -H 'authority: developer.download.nvidia.com' -H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7' -H 'accept-language: en-US,en;q=0.9,zh-CN;q=0.8,zh;q=0.7,zh-TW;q=0.6,de;q=0.5' -H 'cookie: YOUR_OWN_COOKIES!!!!!!!!' -H 'dnt: 1' -H 'referer: https://developer.nvidia.com/' -H 'sec-ch-ua: "Chromium";v="110", "Not A(Brand";v="24", "Google Chrome";v="110"' -H 'sec-ch-ua-mobile: ?0' -H 'sec-ch-ua-platform: "macOS"' -H 'sec-fetch-dest: document' -H 'sec-fetch-mode: navigate' -H 'sec-fetch-site: same-site' -H 'sec-fetch-user: ?1' -H 'upgrade-insecure-requests: 1' -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36' --compressed -o cudnn.deb

The resulting cudnn.deb should be about 875MB. Then use apt to install it to local.

sudo apt install ./cudnn.deb

You have now installed the sources and the key to your local computer as local apt source repos. Now to recognize the sources lists, you need the repo GPG keys. Notice in the output that there’s a line that says something like:

sudo cp /var/cudnn-local-repo-ubuntu2204-8.9.4.25/cudnn-local-3C3A81D3-keyring.gpg /usr/share/keyrings/

Having copied the key, now update your sources lists:

sudo apt update

You should see a list of (local) installation candidates with:

apt search cudnn

You should see output like

$ apt search cudnn
Sorting... Done
Full Text Search... Done
cudnn-local-repo-ubuntu2204-8.9.4.25/now 1.0-1 amd64 [installed,local]
cudnn-local repository configuration files

libcudnn8/unknown 8.9.4.25-1+cuda12.2 amd64 [upgradable from: 8.9.3.28-1+cuda12.1]
cuDNN runtime libraries

libcudnn8-dev/unknown 8.9.4.25-1+cuda12.2 amd64 [upgradable from: 8.9.3.28-1+cuda12.1]
cuDNN development libraries and headers

libcudnn8-samples/unknown 8.9.4.25-1+cuda12.2 amd64 [upgradable from: 8.9.3.28-1+cuda12.1]
cuDNN samples

Now install the three local cudnn8 repos:

sudo apt install libcudnn8 libcudnn8-dev libcudnn8-samples

You should be all set.

3.3 Verify cuDNN Installation on System

Issue the following command to make sure that the correct CUDNN libraries have been installed on your system:

cat /usr/include/x86_64-linux-gnu/cudnn_version_v8.h | grep CUDNN_MAJOR -A 2

The output should look something like this for both:

#define CUDNN_MAJOR 8
#define CUDNN_MINOR 9
#define CUDNN_PATCHLEVEL 4
--
#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL)

/* cannot use constexpr here since this is a C-only file */

However, given so many factors, your NVIDIA tooling may not be 100% correct. To double-check, proceed to the next verification step: verify cuDNN with sample scripts.

3.4. Verify cuDNN installation by using it on sample scripts.

cuDNN ships with its own samples (which you already got by installing the libcudnn8-samples apt package). Note that the sample location and dependencies have changed in CUDNN 8.x.x from CUDNN 7.x.x. so you can no longer use the v7 instructions. Like CUDA 12.2, cuDNN v8 samples also require FreeImage.

The second and third differences are the samples path and the make structure. The samples now live in /usr/src/cudnn_samples_v8 and you can no longer compile all samples with a single sudo make command which you can still do with the CUDA samples. Let’s do the mnistCUDNN check:

cd /usr/src/cudnn_samples_v8/mnistCUDNN
sudo make -j$(nproc)
./mnistCUDNN

The output should have a Test passed! on the last line:

Executing: mnistCUDNN
cudnnGetVersion() : 8904 , CUDNN_VERSION from cudnn.h : 8904 (8.9.4)
Host compiler version : GCC 11.4.0

There are 10 CUDA capable devices on your machine :
device 0 : sms 64 Capabilities 8.6, SmClock 1695.0 Mhz, MemSize (Mb) 24247, MemClock 8001.0 Mhz, Ecc=0, boardGroupID=0
device 1 : sms 64 Capabilities 8.6, SmClock 1695.0 Mhz, MemSize (Mb) 24247, MemClock 8001.0 Mhz, Ecc=0, boardGroupID=1
device 2 : sms 64 Capabilities 8.6, SmClock 1695.0 Mhz, MemSize (Mb) 24247, MemClock 8001.0 Mhz, Ecc=0, boardGroupID=2
device 3 : sms 64 Capabilities 8.6, SmClock 1695.0 Mhz, MemSize (Mb) 24247, MemClock 8001.0 Mhz, Ecc=0, boardGroupID=3
device 4 : sms 64 Capabilities 8.6, SmClock 1695.0 Mhz, MemSize (Mb) 24247, MemClock 8001.0 Mhz, Ecc=0, boardGroupID=4
device 5 : sms 64 Capabilities 8.6, SmClock 1695.0 Mhz, MemSize (Mb) 24247, MemClock 8001.0 Mhz, Ecc=0, boardGroupID=5
device 6 : sms 64 Capabilities 8.6, SmClock 1695.0 Mhz, MemSize (Mb) 24247, MemClock 8001.0 Mhz, Ecc=0, boardGroupID=6
device 7 : sms 64 Capabilities 8.6, SmClock 1695.0 Mhz, MemSize (Mb) 24247, MemClock 8001.0 Mhz, Ecc=0, boardGroupID=7
device 8 : sms 64 Capabilities 8.6, SmClock 1695.0 Mhz, MemSize (Mb) 24247, MemClock 8001.0 Mhz, Ecc=0, boardGroupID=8
device 9 : sms 64 Capabilities 8.6, SmClock 1695.0 Mhz, MemSize (Mb) 24247, MemClock 8001.0 Mhz, Ecc=0, boardGroupID=9
Using device 0

Testing single precision
Loading binary file data/conv1.bin
Loading binary file data/conv1.bias.bin
Loading binary file data/conv2.bin
Loading binary file data/conv2.bias.bin
Loading binary file data/ip1.bin
Loading binary file data/ip1.bias.bin
Loading binary file data/ip2.bin
Loading binary file data/ip2.bias.bin
Loading image data/one_28x28.pgm
Performing forward propagation ...
Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 178432 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 184784 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 2057744 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.063488 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.067584 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.100352 time requiring 2057744 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.135168 time requiring 178432 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 1.193984 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 2.981888 time requiring 184784 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 128848 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 128000 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 4656640 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 2450080 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 1433120 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.097280 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.126976 time requiring 128000 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.130048 time requiring 4656640 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 1.271808 time requiring 1433120 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 3.746816 time requiring 128848 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 4.895744 time requiring 2450080 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Resulting weights from Softmax:
0.0000000 0.9999399 0.0000000 0.0000000 0.0000561 0.0000000 0.0000012 0.0000017 0.0000010 0.0000000
Loading image data/three_28x28.pgm
Performing forward propagation ...
Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 178432 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 184784 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 2057744 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.038912 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.046080 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.049152 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.101376 time requiring 178432 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.103424 time requiring 2057744 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.118784 time requiring 184784 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 128848 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 128000 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 4656640 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 2450080 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 1433120 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.088064 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.098304 time requiring 1433120 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.103424 time requiring 2450080 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.117760 time requiring 128000 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.117760 time requiring 4656640 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.122880 time requiring 128848 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Resulting weights from Softmax:
0.0000000 0.0000000 0.0000000 0.9999288 0.0000000 0.0000711 0.0000000 0.0000000 0.0000000 0.0000000
Loading image data/five_28x28.pgm
Performing forward propagation ...
Resulting weights from Softmax:
0.0000000 0.0000008 0.0000000 0.0000002 0.0000000 0.9999820 0.0000154 0.0000000 0.0000012 0.0000006

Result of classification: 1 3 5

Test passed!

Testing half precision (math in single precision)
Loading binary file data/conv1.bin
Loading binary file data/conv1.bias.bin
Loading binary file data/conv2.bin
Loading binary file data/conv2.bias.bin
Loading binary file data/ip1.bin
Loading binary file data/ip1.bias.bin
Loading binary file data/ip2.bin
Loading binary file data/ip2.bias.bin
Loading image data/one_28x28.pgm
Performing forward propagation ...
Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 4608 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 28800 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 178432 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 184784 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 2057744 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.026624 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.043008 time requiring 28800 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.072704 time requiring 178432 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.086016 time requiring 184784 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.091136 time requiring 4608 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.095232 time requiring 2057744 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 1536 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 64000 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 4656640 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 2450080 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 1433120 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.100352 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.107520 time requiring 1433120 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.115712 time requiring 2450080 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.117760 time requiring 64000 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.128000 time requiring 4656640 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.224256 time requiring 1536 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Resulting weights from Softmax:
0.0000001 1.0000000 0.0000001 0.0000000 0.0000563 0.0000001 0.0000012 0.0000017 0.0000010 0.0000001
Loading image data/three_28x28.pgm
Performing forward propagation ...
Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 4608 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 28800 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 178432 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 184784 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 2057744 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.029696 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.037888 time requiring 28800 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.071680 time requiring 4608 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.073728 time requiring 178432 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.081920 time requiring 184784 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.095232 time requiring 2057744 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 1536 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 64000 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 4656640 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 2450080 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 1433120 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.087040 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.089088 time requiring 4656640 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.093184 time requiring 2450080 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.105472 time requiring 1433120 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.106496 time requiring 64000 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.205824 time requiring 1536 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Resulting weights from Softmax:
0.0000000 0.0000000 0.0000000 1.0000000 0.0000000 0.0000714 0.0000000 0.0000000 0.0000000 0.0000000
Loading image data/five_28x28.pgm
Performing forward propagation ...
Resulting weights from Softmax:
0.0000000 0.0000008 0.0000000 0.0000002 0.0000000 1.0000000 0.0000154 0.0000000 0.0000012 0.0000006

Result of classification: 1 3 5

Test passed!

Note that the sms number here means the number of Streaming Multiprocessors on your GPU, not the architecture version of the Streaming Multiprocessor.

3.5 Cleanup

Remove the files we just downloaded:

cd /usr/src/cudnn_samples_v8/mnistCUDNN/
sudo make clean # under /usr/local/
rm ~/cudnn.deb

4. Install Miniconda3

You should use the latest from the repo, https://repo.anaconda.com/miniconda, instead of the stale versions on https://docs.conda.io/en/latest/miniconda.html#linux-installers. Pro tip: when you have to agree to a long agreement to install a library on the command line, you can hit q to quickly skip to the bottom where you type yes/accept, etc.

cd ~
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
sha256sum Miniconda3-latest-Linux-x86_64.sh
# Verify that the output matches the one online
sh Miniconda3-latest-Linux-x86_64.sh

Say yes to conda init. It will save you lots of time.

Lastly, clean up with

rm ~/Miniconda3-latest-Linux-x86_64.sh

5. Build PyTorch from Source

We are finally in the final stretch. Building from source. Some instructions are taken from https://github.com/pytorch/pytorch#from-source

5.1 Install PyTorch Prerequisites

Use a new environment with a custom Python version. Suppose you want to strictly adhere to this version of PyTorch and clone this repo

$ source ~/.bashrc
(base) $ conda create -n pt_source_2.0.1_cu12.2.1_535.86.10_cudnn8.9.3.28_intelpy310

Now — be sure to follow the following steps in your path are all under your newly created environment, so they aren’t installed in your base environment:

(base) $ conda activate pt_source_2.0.1_cu12.2.1_535.86.10_cudnn8.9.3.28_intelpy310

Your shell will now say pt_source_2.0.1_cu12.2.1_535.86.10_cudnn8.9.3.28_intelpy310 instead of base .

(pt_source_2.0.1_cu12.2.1_535.86.10_cudnn8.9.3.28_intelpy310) $ # conda environment activated!

Now install the PyTorch Linux dependencies:

conda install -c intel intelpython3_full python=3.10 mkl-dpcpp mkl-include intel-openmp intel-fortran-rt dpcpp-cpp-rt numpy
conda install -c pkgs/main cmake ninja astunparse expecttest hypothesis psutil pyyaml requests setuptools typing-extensions sympy filelock networkx jinja2 fsspec
pip install types-dataclasses

Install NVIDIA and Docker

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

then

sudo apt update

Then

sudo apt install nvidia-container-toolkit-base

Then

sudo nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml

Then

curl https://get.docker.com | sh   && sudo systemctl --now enable docker

Then

sudo apt install nvidia-container-toolkit

Then

sudo nvidia-ctk runtime configure --runtime=docker

Then

sudo systemctl restart docker

Test with

sudo docker run --rm --runtime=nvidia --gpus all docker pull nvidia/cuda:12.2.0-runtime-ubuntu20.04 nvidia-smi

Add user roles for

sudo groupadd docker
sudo usermod -aG docker ${USER}

Build Magma

Because magma-cuda122 is not yet available, build it now with

git clone --single-branch --branch v2.7.1 --depth 1 https://bitbucket.org/icl/magma.git
cd magma

echo -e "GPU_TARGET = sm_86\nBACKEND = cuda\nFORT = false" > make.inc
make generate

export LD_LIBRARY_PATH="${CONDA_PREFIX}/lib:/usr/local/cuda/targets/x86_64-linux/lib${LD_LIBRARY_PATH:+${LD_LIBRARY_PATH}:}"
export CUDA_DIR="/usr/local/cuda/12.2"
export CONDA_LIB=${CONDA_PREFIX}/lib

# be careful here; they didn't accept sm_89 so I had to round it down to major version, sm_80
make clean && rm -rf build/

TARGETARCH=amd64 cmake -H. -Bbuild -DUSE_FORTRAN=OFF -DGPU_TARGET="Ampere" -DBUILD_SHARED_LIBS=OFF -DBUILD_STATIC_LIBS=ON -DCMAKE_CXX_FLAGS="-fPIC" -DCMAKE_C_FLAGS="-fPIC" -DMKLROOT=${CONDA_PREFIX} -DCUDA_NVCC_FLAGS="-Xfatbin;-compress-all;-DHAVE_CUBLAS;-std=c++11;--threads=0;" -GNinja

sudo mkdir /usr/local/magma/

sudo cmake --build build -j $(nproc) --target install

sudo cp build/include/* /usr/local/magma/include/
sudo cp build/lib/*.so /usr/local/magma/lib/
sudo cp build/lib/*.a /usr/local/magma/lib/
sudo cp build/lib/pkgconfig/*.pc /usr/local/magma/lib/pkgconfig/
sudo cp /usr/local/magma/include/* ${CONDA_PREFIX}/include/
sudo cp /usr/local/magma/lib/*.a ${CONDA_PREFIX}/lib/
sudo cp /usr/local/magma/lib/*.so ${CONDA_PREFIX}/lib/
sudo cp /usr/local/magma/lib/pkgconfig/*.pc ${CONDA_PREFIX}/lib/pkgconfig/

Clone the PyTorch repository (this takes a long time due to recursive):

cd
git clone --recursive --single-branch --branch v2.0.1 https://github.com/pytorch/pytorch.git
cd pytorch

Optionally, you may or may not want to update the PyTorch dependencies in the folder:

# optional
git submodule sync
git submodule update --init --recursive

Install NUMA

sudo apt install libnuma-dev libnuma1

Patch. Add the diff file to the pytorch repo root as a file called sclarkson.diff from this discussion: https://github.com/pytorch/pytorch/issues/90448#issuecomment-1342070946

--- a/torch/csrc/distributed/c10d/ProcessGroupGloo.hpp
+++ b/torch/csrc/distributed/c10d/ProcessGroupGloo.hpp
@@ -125,7 +125,7 @@
}

void wait(const std::vector<std::string>& keys) override {
- store_->wait(keys, Store::kDefaultTimeout);
+ store_->wait(keys, ::c10d::Store::kDefaultTimeout);
}

void wait(
  1. Then under the pytorch project root, do patch -p1 -i sclarkson.diff

Install NCCL

wget https://developer.nvidia.com/downloads/compute/machine-learning/nccl/secure/2.18.3/ubuntu2204/x86_64/nccl-local-repo-ubuntu2204-2.18.3-cuda12.2_1.0-1_amd64.deb/

Use the cURL trick same similar to the cuDNN installation by visiting https://developer.nvidia.com/downloads/compute/machine-learning/nccl/secure/2.18.3/ubuntu2204/x86_64/nccl-local-repo-ubuntu2204-2.18.3-cuda12.2_1.0-1_amd64.deb/ in your logged-in browser. The captive survey is bugged so you cannot bypass it a second time.

Then

sudo apt install ./nccl.deb

Update sources

sudo cp /var/nccl-local-repo-ubuntu2204-2.18.3-cuda12.2/nccl-local-F4B315EB-keyring.gpg /usr/share/keyrings/
sudo apt update
sudo apt install libnccl-dev libnccl2

Finally

cd ~/pytorch
export LD_LIBRARY_PATH="${CONDA_PREFIX}/lib:/usr/local/cuda/targets/x86_64-linux/lib${LD_LIBRARY_PATH:+${LD_LIBRARY_PATH}:}"
export _GLIBCXX_USE_CXX11_ABI=1
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}export LD_PRELOAD="${CONDA_PREFIX}/lib/libiomp5.so"
export MKL_OPENMP_LIBRARY="${CONDA_PREFIX}/lib/libiomp5.so"
export OpenMP_GNU_FLAG_CANDIDATES="-fopenmp=libiomp5"
export CMAKE_INCLUDE_PATH="${CONDA_PREFIX}/include"
export MKL_OPENMP_TYPE="Intel"
export MKL_THREADING="OMP"
export USE_SYSTEM_NCCL=ON
export TORCH_CUDA_ARCH_LIST="8.6"
export CUDA_HOME="/usr/local/cuda"

ln -sf /usr/lib/x86_64-linux-gnu/libstdc++.so.6 ${CONDA_PREFIX}/lib

python setup.py develop

This is going to take a little while.

In dev, do CUDA_MODULE_LOADING=LAZY for short-term program init speed. However in training/production, do CUDA_MODULE_LOADING=EAGER. To set this, do:

# Either User-only
echo "export CUDA_MODULE_LOADING=EAGER" >> ~/.bashrc
source ~/.bashrc

# Or system-wide
# redirect symbols don't work for sudo, so use tee
echo "CUDA_MODULE_LOADING=EAGER" | sudo tee -a /etc/environment
# Reboot with sudo reboot for it to take effect

Verify PyTorch installation with collect_env

python -m torch.utils.collect_env

You MUST leave the PyTorch source directory to do this. Otherwise, Python will assume you want to use the torch built under your ~/pytorch build folder which will not be importable and will give you an error even if you installed PyTorch correctly.

# You are probably under ~/pytorch right now. Get out of there!
$ cd # This is equivalent to cd ~, but you can go elsewhere
$ python
>>> import torch;
>>> torch.rand(2, 3, device='cuda') @ torch.rand(3, 2, device='cuda') # Check CUDA is working
>>> torch.svd(torch.rand(3,3, device='cuda') # Check MAGMA-CUDA is working
>>> exit() # Get out of the Python shell.

The output should look something like this (except the numbers won’t be the same because of randomness) before exiting the Python shell:

tensor([[0.5708, 0.6166],
[0.6130, 0.7249]], device='cuda:0')


torch.return_types.svd(
U=tensor([[ 0.7157, 0.6304, 0.3007],
[ 0.4860, -0.1404, -0.8626],
[ 0.5016, -0.7635, 0.4069]], device='cuda:0'),
S=tensor([1.3666, 0.4503, 0.3194], device='cuda:0'),
V=tensor([[ 0.5648, 0.2773, -0.7773],
[ 0.7025, -0.6558, 0.2765],
[ 0.4330, 0.7022, 0.5652]], device='cuda:0'))

6. (Optional) Install Pillow-SIMD and libjpeg-turbo

If your code uses Pillow, you can further optimize your vision pipeline with libjpeg-turbo and Pillow-SIMD (although torchvision.io is supposed to be better if you are writing your own code).

First, build libjpeg-turbo from source:

wget https://github.com/libjpeg-turbo/libjpeg-turbo/archive/refs/tags/3.0.0.tar.gz

Unzip it

tar -xzf 3.0.0.tar.gz

go to it

cd libjpeg-turbo-3.0.0/

Install dependencies yasm

sudo apt install yasm

Build from source and install in your conda env libs.

mkdir build
cd build
cmake -G"Unix Makefiles" -DCMAKE_INSTALL_PREFIX:PATH=${CONDA_PREFIX} ..
make
make install

Install Pillow-SIMD

Go back

cd

Download pillow-simd

wget https://github.com/uploadcare/pillow-simd/archive/refs/tags/9.5.0.tar.gz

Unzip

tar -xzf 9.5.0.tar.gz

Build

cd pillow-simd-9.5.0/

Install

CPATH=${CONDA_PREFIX}/include LIBRARY_PATH=${CONDA_PREFIX}/lib CC="cc -mavx2" python setup.py develop

Make sure your libjpeg-turbo and Pillow-SIMD are installed.

$ python
>>> from PIL import __version__, features
>>> print(f"PIL version: {__version__}")
>>> features.check_feature('libjpeg_turbo')
>>> exit()

You should see something like this:

>>> from PIL import __version__, features
>>> print(f"PIL version: {__version__}")
PIL version: 9.5.0
>>> features.check_feature('libjpeg_turbo')
True

You are all set!

7. Lastly, install torchvision from source after installing Pillow-SIMD and libjpeg-turbo

Follow the instructions on https://github.com/pytorch/vision and find a version that matches your built PyTorch version. Here we use 0.15.2 to match PyTorch version v2.0.1:

cd ~
git clone --single-branch --branch v0.15.2 https://github.com/pytorch/vision.git
cd vision
python setup.py install

Voila!

TODOs:

  1. Implement USE_STATIC_MKL build option to skirt the hard-coded GNU non-intel threading issue.

--

--

Zhanwen Chen
Repro Repo

A PhD student interested in learning from data.