Install CUDA 9.0 and cuDNN 7.0 for TensorFlow/PyTorch (GPU) on Ubuntu 16.04

Note: I just wrote a post on installing CUDA 9.2 and cuDNN 7.1 here.

You do not have to spend weeks going through official docs while figuring out how to “temporarily add the number ‘3’ and the word ‘nomodeset’ to the end of the system’s kernel boot parameters.” Others suggest turning off lightdm and nouveau which you haven’t even heard of. But what if you don’t actually have to do any of these?

It turns out that you can disregard them. I followed someone’s gist, clarified and updated the instructions, and recorded my steps.

Here’s what I did:

1. Install NVIDIA Graphics Driver via apt-get

CUDA 9.0 requires NVIDIA driver version 384 or above. To install the driver, use apt-get instead of the CUDA runfile:

sudo apt-get install nvidia-384 nvidia-modprobe

, and then you will be prompted to disable Secure Boot. Select Disable.

Reboot your machine but enter BIOS to disable Secure Boot. Typically you can enter BIOS by hitting F12 rapidly as soon as the system restarts.

Afterwards, you can check the installation with the nvidia-smi command, which will report all your CUDA-capable devices in the system, like this:

Wed Apr 11 23:34:18 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111 Driver Version: 384.111 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1060 Off | 00000000:01:00.0 Off | N/A |
| N/A 60C P5 8W / N/A | 242MiB / 6072MiB | 3% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1003 G /usr/lib/xorg/Xorg 177MiB |
| 0 1646 G compiz 60MiB |
| 0 2230 G /usr/lib/firefox/firefox 1MiB |
+-----------------------------------------------------------------------------+

2. Install CUDA 9.0 via Runfile

Installing CUDA from runfile is actually a breeze, compared to apt-get which involves adding NVIDIA repos and messing up your configuration.

The CUDA runfile installer can be downloaded from NVIDIA’s website, or using wget in case you can’t find it easily on NVIDIA:

(Note: the version I downloaded at first was 384.81 but NVidia constantly releases new minor versions. I recommend the latest which is 384.130 as of 8/29/2018)

$ cd
$ wget https://developer.nvidia.com/compute/cuda/9.0/Prod/local_installers/cuda_9.0.176_384.81_linux-run

What you download is a package the following three components:

  1. an NVIDIA driver installer, but usually of stale version;
  2. the actual CUDA installer;
  3. the CUDA samples installer;

I suggest extracting the above three components and executing 2 and 3 separately (remember we installed the driver ourselves already). To extract them, execute the runfile installer with --extract option:

$ chmod +x cuda_9.0.176_384.81_linux-run
$ ./cuda_9.0.176_384.81_linux-run --extract=$HOME

You should have unpacked three components: NVIDIA-Linux-x86_64-384.81.run (1. NVIDIA driver that we ignore), cuda-linux.9.0.176-22781540.run (2. CUDA 9.0 installer), and cuda-samples.9.0.176-22781540-linux.run (3. CUDA 9.0 Samples).

Execute the second one to install the CUDA Toolkit 9.0:

$ sudo ./cuda-linux.9.0.176-22781540.run

You now have to accept the license by scrolling down to the bottom (hit the “d” key on your keyboard) and enter “accept”. Next accept the defaults.

To verify our CUDA installation, install the sample tests by

$ sudo ./cuda-samples.9.0.176-22781540-linux.run

After the installation finishes, configure the runtime library.

$ sudo bash -c "echo /usr/local/cuda/lib64/ > /etc/ld.so.conf.d/cuda.conf"
$ sudo ldconfig

It is also recommended for Ubuntu users to append string /usr/local/cuda/bin to system file /etc/environment so that nvcc will be included in $PATH. This will take effect after reboot. To do that, you just have to

$ sudo vim /etc/environment

and then add :/usr/local/cuda/bin (including the ":") at the end of the PATH="/blah:/blah/blah" string (inside the quotes).

After a reboot, let's test our installation by making and invoking our tests:

$ cd /usr/local/cuda-9.0/samples
$ sudo make

It’s a long process with many irrelevant warnings about deprecated architectures (sm_20 and such ancient GPUs). After it completes, run deviceQuery and p2pBandwidthLatencyTest:

$ cd /usr/local/cuda/samples/bin/x86_64/linux/release
$ ./deviceQuery

The result of running deviceQuery should look something like this:

./deviceQuery Starting...
 CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 1060"
CUDA Driver Version / Runtime Version 9.0 / 9.0
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 6073 MBytes (6367739904 bytes)
(10) Multiprocessors, (128) CUDA Cores/MP: 1280 CUDA Cores
GPU Max Clock rate: 1671 MHz (1.67 GHz)
Memory Clock rate: 4004 Mhz
Memory Bus Width: 192-bit
L2 Cache Size: 1572864 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 1
Result = PASS

Cleanup: if ./deviceQuery works, remember to rm the 4 files (1 downloaded and 3 extracted).

3. Install cuDNN 7.0

The recommended way to install cuDNN 7.0 is to download all 3 .deb files. I had previously recommended using the .tgz installation approach, but found out that it didn’t allow verification by running code samples (no way to install the code samples .deb after .tgz installation).

The following steps are pretty much the same as the installation guide using .deb files (strange that the cuDNN guide is better than the CUDA one).

  1. Go to the cuDNN download page (need registration) and select the latest cuDNN 7.0.* version made for CUDA 9.0.
  2. Download all 3 .deb files: the runtime library, the developer library, and the code samples library for Ubuntu 16.04.
  3. In your download folder, install them in the same order:

$ sudo dpkg -i libcudnn7_7.0.5.15–1+cuda9.0_amd64.deb (the runtime library),

$ sudo dpkg -i libcudnn7-dev_7.0.5.15–1+cuda9.0_amd64.deb (the developer library), and

$ sudo dpkg -i libcudnn7-doc_7.0.5.15–1+cuda9.0_amd64.deb (the code samples).

Now we can verify the cuDNN installation (below is just the official guide, which surprisingly works out of the box):

  1. Copy the code samples somewhere you have write access: cp -r /usr/src/cudnn_samples_v7/ ~.
  2. Go to the MNIST example code: cd ~/cudnn_samples_v7/mnistCUDNN.
  3. Compile the MNIST example: make clean && make.
  4. Run the MNIST example: ./mnistCUDNN. If your installation is successful, you should see Test passed! at the end of the output.

Do NOT Install cuda-command-line-tools

Contrary to the official TensorFlow installation docs, you don’t need to install cuda-command-line-tools because it’s already installed in this version of CUDA and cuDNN. If you apt-get it, you won’t find it.

Configure the CUDA and cuDNN library paths

What you do need to do, however, is exporting environment variables LD_LIBRARY_PATH in your .bashrc file:

# put the following line in the end or your .bashrc file
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/extras/CUPTI/lib64"

And source it by source ~/.bashrc.

Install and Verify TensorFlow-GPU from binary. If you want to go even crazier and take advantage of extra CPU instructions (which is useful for preprocessing et al), follow this guide for building from source.

  1. Install TensorFlow-GPU with pip3: pip3 install --upgrade tensorflow-gpu.
  2. Try it in Python:
$ python3
>>> import tensorflow as tf
>>> sess = \
tf.Session(config=tf.ConfigProto(log_device_placement=True))

If successful, your output should look something like this (I have a 1060)

2018-04-13 13:27:02.243226: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-04-13 13:27:02.383829: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-13 13:27:02.384253: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties:
name: GeForce GTX 1060 major: 6 minor: 1 memoryClockRate(GHz): 1.6705
pciBusID: 0000:01:00.0
totalMemory: 5.93GiB freeMemory: 5.52GiB
2018-04-13 13:27:02.384266: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-04-13 13:27:02.567194: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-04-13 13:27:02.567223: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917] 0
2018-04-13 13:27:02.567255: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0: N
2018-04-13 13:27:02.567493: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5300 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1060, pci bus id: 0000:01:00.0, compute capability: 6.1)
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1060, pci bus id: 0000:01:00.0, compute capability: 6.1
2018-04-13 13:27:02.624330: I tensorflow/core/common_runtime/direct_session.cc:297] Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1060, pci bus id: 0000:01:00.0, compute capability: 6.1