Install CUDA 10.0 and cuDNN 7.5.0 for PyTorch on Ubuntu 18.04 LTS

Published in

Repro Repo

8 min readMar 25, 2019

Now that TensorFlow finally supports/requires CUDA 10.0, it’s time to upgrade CUDA and cuDNN. The reason we are using 10.0 instead of 10.1 is that there is absolutely no support for CUDA 10.1 from PyTorch (building from source, due to the lack of magma-cuda101).

I thought the process would be as easy as following my own guides for 9.0 and 9.2, but NVIDIA said no. NVIDIA overhauled the installation process for the worse. They came out with more install options, but none of them actually works “out of the box” as of writing.

I found a fix: install nvidia-driver-410 from the graphics-drivers ppa before touching the cuda package. The reason is that the default nvidia-driver-410 has a broken dependency on xserver-xorg-core which causes apt-get to remove essential graphics packages such as ubuntu-desktop and pretty much everything else you actually need. This issue is documented on this Reddit thread, this StackOverflow thread, and this NVIDIA devTalk thread. For later reference, nvidia-driver-418 (for CUDA 10.1) has the same problem.

WARNING: before proceeding, you should back up your data, because even though this was the only successful attempt for me, it may turn out differently for you given the intricate nature of the Linux graphics stack. I tried 10 other paths and had to reinstall Ubuntu after each one.

0. IMPORTANT: Install `nvidia-driver-410` from the `graphics-drivers` ppa

As mentioned above, the default nvidia-driver-410 package may give you a permanent black screen because it requires xserver-xorg-core. Instead, do this first:

sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt-get update
sudo apt-get install nvidia-driver-410

Now do a reboot to apply the changes. After restarting, you should be able to log in as usual (no black screen, hopefully), and running nvidia-smi should give you the correct driver version — 410.xx like this:

nvidia-smi
Mon Mar 25 16:30:54 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1060    Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   55C    P2    24W /  N/A |    193MiB /  6078MiB |      1%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1227      G   /usr/lib/xorg/Xorg                           117MiB |
|    0      1402      G   /usr/bin/gnome-shell                          73MiB |
+-----------------------------------------------------------------------------+

1. Install CUDA 10.0 Using the Local .deb Option

Now we can get on with CUDA. This is very different from the 9.0 and 9.2 instructions: we now use the .deb instead of the runfile option, though just like before, we are ignoring the included graphics driver in this .deb approach.

Download the CUDA .deb file from the official site. Then, simply follow the instructions on the download page:

sudo dpkg -i cuda-repo-ubuntu1804-10-0-local-10.0.130-410.48_1.0-1_amd64.deb
sudo apt-key add /var/cuda-repo-10-0-local-10.0.130-410.48/7fa2af80.pub
sudo apt-get update
sudo apt-get install cuda

Then, restart again just to confirm that you haven’t got black screen. If successful, you should add these two lines to the bottom of your ~/.bashrc to complete post-installation configuration:

# CUDA Config - ~/.bashrc
export PATH=/usr/local/cuda-10.0/bin:/usr/local/cuda-10.0/NsightCompute-1.0${PATH:+:${PATH}}export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64\
                         ${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

To apply this change, reload ~/.bashrc with

source ~/.bashrc

Now, let’s verify that our CUDA installation is complete. To do this, we need to compile the provided code samples like this (warning: this could take 5–15 minutes):

cd /usr/local/cuda-10.0/samples
sudo make

Ignore the numerous warnings about deprecated architectures (sm_20 and such ancient GPUs). After it completes, let’s run two tests: deviceQuery and matrixMulCUBLAS. First, try deviceQuery:

/usr/local/cuda-10.0/samples/bin/x86_64/linux/release/deviceQuery

You should get something like

$ /usr/local/cuda-10.0/samples/bin/x86_64/linux/release/deviceQuery
/usr/local/cuda-10.0/samples/bin/x86_64/linux/release/deviceQuery Starting...CUDA Device Query (Runtime API) version (CUDART static linking)Detected 1 CUDA Capable device(s)Device 0: "GeForce GTX 1060"
  CUDA Driver Version / Runtime Version          10.0 / 10.0
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 6078 MBytes (6373572608 bytes)
  (10) Multiprocessors, (128) CUDA Cores/MP:     1280 CUDA Cores
  GPU Max Clock rate:                            1671 MHz (1.67 GHz)
  Memory Clock rate:                             4004 Mhz
  Memory Bus Width:                              192-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.0, CUDA Runtime Version = 10.0, NumDevs = 1
Result = PASS

Just to make sure we’ve configured CUDA correctly, run a computation-based test:

/usr/local/cuda-10.0/samples/bin/x86_64/linux/release/matrixMulCUBLAS

If your installation is successful, you should see something like this:

$ /usr/local/cuda-10.0/samples/bin/x86_64/linux/release/matrixMulCUBLAS
[Matrix Multiply CUBLAS] - Starting...
GPU Device 0: "GeForce GTX 1060" with compute capability 6.1GPU Device 0: "GeForce GTX 1060" with compute capability 6.1MatrixA(640,480), MatrixB(480,320), MatrixC(640,320)
Computing result using CUBLAS...done.
Performance= 2350.06 GFlop/s, Time= 0.084 msec, Size= 196608000 Ops
Computing result using host CPU...done.
Comparing CUBLAS Matrix Multiply with CPU results: PASSNOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

If this test didn’t pass for you, please leave a comment and I’ll share some debugging ideas.

2. Install cuDNN 7.5.0 Using .deb Files

The instructions for cuDNN haven’t changed from my previous guides. The .deb files option remains the easiest and most reliable one in that it allows us to verify our installation.

The following steps are pretty much the same as the official instructions using .deb files (strange that cuDNN has better documentation than CUDA).

Go to the cuDNN download page (need registration) and select the latest cuDNN 7.5.* version made for CUDA 10.0.
Download all 3 .deb files: the runtime library, the developer library, and the code samples library for Ubuntu 18.04.
In your download folder, install them in the same order:

$ sudo dpkg -i libcudnn7_7.5.0.56–1+cuda10.0_amd64.deb (the runtime library),

$ sudo dpkg -i libcudnn7-dev_7.5.0.56–1+cuda10.0_amd64.deb (the developer library), and

$ sudo dpkg -i libcudnn7-doc_7.5.0.56–1+cuda10.0_amd64.deb (the code samples).

Now we can verify the cuDNN installation (below is just the official guide, which surprisingly works out of the box):

Go to the MNIST example code: cd /usr/src/cudnn_samples_v7/mnistCUDNN/.
Compile the MNIST example: sudo make clean && sudo make.
Run the MNIST example: ./mnistCUDNN. If your installation is successful, you should see Test passed! at the end of the output, like this:

$ ./mnistCUDNN 
cudnnGetVersion() : 7500 , CUDNN_VERSION from cudnn.h : 7500 (7.5.0)
Host compiler version : GCC 7.3.0
There are 1 CUDA capable devices on your machine :
device 0 : sms 10  Capabilities 6.1, SmClock 1670.5 Mhz, MemSize (Mb) 6078, MemClock 4004.0 Mhz, Ecc=0, boardGroupID=0
Using device 0Testing single precision
Loading image data/one_28x28.pgm
Performing forward propagation ...
Testing cudnnGetConvolutionForwardAlgorithm ...
Fastest algorithm is Algo 1
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.014336 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.030304 time requiring 3464 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.031744 time requiring 57600 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.081920 time requiring 207360 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.114688 time requiring 2057744 memory
Resulting weights from Softmax:
0.0000000 0.9999399 0.0000000 0.0000000 0.0000561 0.0000000 0.0000012 0.0000017 0.0000010 0.0000000 
Loading image data/three_28x28.pgm
Performing forward propagation ...
Resulting weights from Softmax:
0.0000000 0.0000000 0.0000000 0.9999288 0.0000000 0.0000711 0.0000000 0.0000000 0.0000000 0.0000000 
Loading image data/five_28x28.pgm
Performing forward propagation ...
Resulting weights from Softmax:
0.0000000 0.0000008 0.0000000 0.0000002 0.0000000 0.9999820 0.0000154 0.0000000 0.0000012 0.0000006Result of classification: 1 3 5Test passed!Testing half precision (math in single precision)
Loading image data/one_28x28.pgm
Performing forward propagation ...
Testing cudnnGetConvolutionForwardAlgorithm ...
Fastest algorithm is Algo 1
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.016096 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.023552 time requiring 3464 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.028672 time requiring 28800 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.082944 time requiring 207360 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.116736 time requiring 2057744 memory
Resulting weights from Softmax:
0.0000001 1.0000000 0.0000001 0.0000000 0.0000563 0.0000001 0.0000012 0.0000017 0.0000010 0.0000001 
Loading image data/three_28x28.pgm
Performing forward propagation ...
Resulting weights from Softmax:
0.0000000 0.0000000 0.0000000 1.0000000 0.0000000 0.0000714 0.0000000 0.0000000 0.0000000 0.0000000 
Loading image data/five_28x28.pgm
Performing forward propagation ...
Resulting weights from Softmax:
0.0000000 0.0000008 0.0000000 0.0000002 0.0000000 1.0000000 0.0000154 0.0000000 0.0000012 0.0000006Result of classification: 1 3 5Test passed!

You should be all set!

3. Verify Installation

Let’s verify our installation with TensorFlow-GPU, because I’m working on building PyTorch from source.

>>> import tensorflow as tf
>>> sess = \
... tf.Session(config=tf.ConfigProto(log_device_placement=True))
2019-03-25 18:54:35.869335: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-03-25 18:54:35.890758: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2808000000 Hz
2019-03-25 18:54:35.891524: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x5626d2799c60 executing computations on platform Host. Devices:
2019-03-25 18:54:35.891560: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2019-03-25 18:54:35.988733: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-03-25 18:54:35.989415: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x5626d4b838f0 executing computations on platform CUDA. Devices:
2019-03-25 18:54:35.989431: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): GeForce GTX 1060, Compute Capability 6.1
2019-03-25 18:54:35.989666: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: GeForce GTX 1060 major: 6 minor: 1 memoryClockRate(GHz): 1.6705
pciBusID: 0000:01:00.0
totalMemory: 5.94GiB freeMemory: 5.68GiB
2019-03-25 18:54:35.989678: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2019-03-25 18:54:35.990503: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-03-25 18:54:35.990513: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2019-03-25 18:54:35.990536: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2019-03-25 18:54:35.990669: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5512 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060, pci bus id: 0000:01:00.0, compute capability: 6.1)
Device mapping:
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1060, pci bus id: 0000:01:00.0, compute capability: 6.1
2019-03-25 18:54:35.991430: I tensorflow/core/common_runtime/direct_session.cc:317] Device mapping:
/job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device
/job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1060, pci bus id: 0000:01:00.0, compute capability: 6.1

Install CUDA 10.0 and cuDNN 7.5.0 for PyTorch on Ubuntu 18.04 LTS

0. IMPORTANT: Install nvidia-driver-410 from the graphics-drivers ppa

1. Install CUDA 10.0 Using the Local .deb Option

2. Install cuDNN 7.5.0 Using .deb Files

3. Verify Installation

Written by Zhanwen Chen

0. IMPORTANT: Install `nvidia-driver-410` from the `graphics-drivers` ppa