I have a buddy, who has spent a good amount of $ and time setting up a GPU cluster. It was a no- trivial effort to bring the beast up online. After that, I was asked to get TensorFlow to work with it.
I immediately started my work. During the process, I was surprised to find multiple roadblocks, and spent lots of time searching for solutions. Until recently, I was finally able to put everything together. I am hereby sharing my experience, and hopefully it can help you.
Here is my procedure:
- OS: Ubuntu 16.04 (codename: xenial)
- Nvidia GeForce GTX 1080 Ti
- Target CUDA versions:
CUDA: 9.1.85, Driver Version: 390.30
I can control CUDA driver version when I install the graphic cards. The default is CUDA 9.1. The first confusion I found was there were many different opinions on whether TensorFlow would work with CUDA 8 only, 9.0 only or even with 9.1, the latest version. After a few trials, I realized that 9.1 was totally fine.
Install Nvidia CUDA Toolkit:
- Nvidia’s installation guide is good enough. I focused on the following sections
- Pre-installation Actions
- CUDA toolkit is available at: https://developer.nvidia.com/cuda-downloads. Based on my platform, this was my choice
- After I downloaded the base installer, I followed the “Installation Instruction” on the same page. Note that I was using
sudo apt-get install cuda-9.1to enforce my version. It turned out to be a good practice, especially when I had multiple CUDA versions on my platform.
- The installation might take quite a few minutes. When it was completed, I updated my ~/.bash_profile with:
Install cuDNN 7.0.5
- cuDNN is part of Nvidia’s Deep Learning SDK. So I needed it.
- The installation page is at https://developer.nvidia.com/cudnn. I needed to register as a user, so will you.
- There are 3 libraries: Runtime library, Developer library and Code samples. Please install all of them. I had no problem following the installation instructions, so I won’t repeat here.
How to Check CUDA Versions
$ nvcc -version, and my result is like: “…release 9.1, V9.1.85”
- Alternatively, I can run
$ cat /usr/local/cuda/version.txt and get “CUDA Version 9.1.85”
$ nvidia-smi, and my result is like: “…NVIDIA-SMI 390.30, Driver Version: 390.30”
Verify Device Versions before Installing TensorFlow
- Verify that /usr/local/cuda/ is symlinked to /usr/local/cuda-9.1/
- $ cd /usr/local/cuda/samples/
- $ make clean && make
- The make process takes a few minutes. Please go get a coffee. After that, go to /usr/local/cuda/samples/bin/x86_64/linux/release/
- Run $ ./deviceQuery and my result is: “deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.1, CUDA Runtime Version = 9.1, NumDevs = 1, Result = PASS”
- An introduction to the complete list of CUDA C++ samples is here
Install TensorFlow 1.5 from Source
- This is a big discovery: I was not able to install TF from its pre-installed binary. As it turned out, I must install it from the source. In addition, TensorFlow 1.6 did not work for me, so I had to stick with 1.5.
- TensorFlow provides its own installation instruction: https://www.tensorflow.org/install/install_sources. It is very detailed but easy to follow. I won’t repeat it here.
- I was using Anaconda 3 (Python 3.6.3) and virtual environment. During
./configure process, I mostly followed the default except the following:
Do you wish to build TensorFlow with CUDA support? [y/N]: yPlease specify the CUDA SDK version you want to use, e.g. 7.0. [Leave empty to default to CUDA 9.0]: 9.1Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7.0]: 7.0.5
- After TF is installed, I could verify my installation with:
>>> import tensorflow as tf
>>> tf.__version__ # My answer is ‘1.5.0’
>>> hello = tf.constant(‘Hello, TensorFlow!’)
>>> sess = tf.Session()
To summarize, the biggest gotcha moments include: TF can work with CUDA 9.1, so you do not have to go through tens of debates on this topic on Stack Overflow, Ask Ubuntu, Google Groups and alike. Second, you have to install TensorFlow from source. The simple
pip3 install tensorflow-gpu will not do the magic for you. (At least it did not work for me.)