If you're like me, you're always itching to try out the latest versions of whatever software you use. Sadly, that often means figuring out how to iron out a few kinks to get things to mix and match. Here's a guide to help you prepare your shiny new Ubuntu for deep learning.
This guide is for those who need to get the most of their Nvidia-based hardware. I assume you have a fresh install of Ubuntu 18.04. If you are just playing around and won't need to use your GPU, you're probably better off installing the pip package instead. If you are an AMD person, instead of CUDA you need to use SYCL (OpenCL), but I can't help much with that.
Installing CUDA: good news, bad news
So, the first thing you should try is to install the tensorflow-gpu pip package, as described in the official install guide. That has never worked for me, though, so I'll only show how to install from source. This has the added advantage that the compiled binaries will take advantage of all optimizations your machine supports.
If you follow the official guide to installing Tensorflow from source, you'll notice they recommend using CUDA 9.0. Now, if your Ubuntu was 17.04 — the latest officially supported version — then you should indeed stick with the 9.0 version from the archives. In that case the prebuilt pip package will likely work. But since your OS is not supported anyway, there's no point in not using the latest CUDA, which right now is 9.1.
The good news is that Ubuntu 18.04 has added CUDA to its multiverse repository. That means you don't need to mess around with adding third-party repositories and all the inevitable version clashes that come with it. You can simply install everything using
sudo apt install nvidia-390 nvidia-cuda-toolkit libcupti-dev gcc-6 python3-numpy python3-dev python3-pip python3-wheel
This will install the (currently) latest graphics driver, CUDA itself, CUPTI, which for some reason doesn't ship with the CUDA package as it should, GCC version 6, which is the latest compatible for compiling CUDA code, and a bunch of Python 3 development essentials. If, for some weird reason, you need to use Python 2, just omit the 3 in
The bad news is that, because it has become a native package, CUDA is installed in a rather non-standard way.
I can’t blame either Nvidia or Canonical. The Nvidia way was the best, most self-contained way, to keep consistency across different distros. It made it easy for third-party dependencies to look for CUDA, and to folks with non-standard distros to retain compatibility. Everything was put under the
/usr/local/cuda-*.* path, so it was easy to maintain versions without relying on a package manager.
But that’s not how native packages work; because they can rely on the package manager’s control, they are installed into the root system paths:
/usr/lib. Imagine if each new package would append to the PATH and LD_LIBRARY_PATH environment variables. Looking for binaries, headers and libraries would soon become slow and prone to obscure namespace issues. This standard makes it easy for all tools to know where to look for their dependencies, but is only viable because the package manager tracks which packages installed which files. It’s a human-friendliness/scalability tradeoff.
Tensorflow will probably eventually update its configuration tool to work with this new installation format, but meanwhile we need to emulate the old way for it to work. The following commands should do the trick:
sudo mkdir -p /usr/local/cuda /usr/local/cuda/extras/CUPTI /usr/local/cuda/nvvm
sudo ln -s /usr/bin /usr/local/cuda/bin
sudo ln -s /usr/include /usr/local/cuda/include
sudo ln -s /usr/lib/x86_64-linux-gnu /usr/local/cuda/lib64
sudo ln -s /usr/local/cuda/lib64 /usr/local/cuda/lib
sudo ln -s /usr/include /usr/local/cuda/extras/CUPTI/include
sudo ln -s /usr/lib/x86_64-linux-gnu /usr/local/cuda/extras/CUPTI/lib64
sudo ln -s /usr/lib/nvidia-cuda-toolkit/libdevice /usr/local/cuda/nvvm/libdevice
Note: this guide assumes you’ve got a 64 bit system (very likely).
Why not do it the traditional way?
You might be wondering, why stick with the multiverse package when the Nvidia-provided one is so much easier to deal with? Well, the main reason is that the Nvidia package simply does not work with the new Ubuntu, because of version clashes with some packages, in particular the graphics driver. When I tried to it that way I ended up with a broken Ubuntu and had to reinstall it from scratch. Despite the additional work it requires, though, the multiverse package is actually quite up-to-date and should cause less headaches long-term, since it doesn’t rely on Nvidia’s care as much.
Installing additional Nvidia libraries
To install cuDNN, simply copy the files over to the
/usr/local/cuda directory you created. Assuming you’ve extracted the .tgz into your Downloads folder:
sudo cp include/* /usr/local/cuda/include/
sudo cp lib64/libcudnn.so.7.1.4 lib64/libcudnn_static.a /usr/local/cuda/lib64/
sudo ln -s libcudnn.so.7.1.4 libcudnn.so.7
sudo ln -s libcudnn.so.7 libcudnn.so
EDIT: As per Ian Jason Min’s comment, I’ve updated this segment, which actually wasn’t correct because symbolic links don’t mix up with actual directories. Sorry about that. On the plus side, I took this opportunity to make some files into symbolic links, which saves about 600MB of space (this also avoids a warning with apt).
This will actually copy them over to the root system paths, which is not ideal because they won’t be tracked by any package manager, but they’re just a few self-contained files, so we can live with that. If you have a more robust procedure in mind, feel free to comment.
To install NCCL, you need a little more work:
sudo mkdir -p /usr/local/cuda/nccl/lib /usr/local/cuda/nccl/include
sudo cp *.txt /usr/local/cuda/nccl
sudo cp include/*.h /usr/include/
sudo cp lib/libnccl.so.2.1.15 lib/libnccl_static.a /usr/lib/x86_64-linux-gnu/
sudo ln -s /usr/include/nccl.h /usr/local/cuda/nccl/include/nccl.h
sudo ln -s libnccl.so.2.1.15 libnccl.so.2
sudo ln -s libnccl.so.2 libnccl.so
for i in libnccl*; do sudo ln -s /usr/lib/x86_64-linux-gnu/$i /usr/local/cuda/nccl/lib/$i; done
EDIT: Once again Ian Jason Min saved the day and pointed out a couple of missing details in the above segment.
This will also install into the root paths, but, again, shouldn’t be a big deal. I’ll show a few commands to undo all of this later on.
One last thing I should comment is that Tensorflow can also use TensorRT to speed up inference, but I couldn’t make it work with this setup. The configuration tool complains about some version incompatibility I couldn’t resolve. If someone figures it out, I’ll update this section.
The official guide recommends installing Bazel with the binary installer, but I actually think the custom repository is easier and better — it will keep things updated. These instructions are easy enough to follow, but I’ll just copy them here for your convenience:
sudo apt install openjdk-8-jdk
echo "deb [arch=amd64] http://storage.googleapis.com/bazel-apt stable jdk1.8" | sudo tee /etc/apt/sources.list.d/bazel.list
curl https://bazel.build/bazel-release.pub.gpg | sudo apt-key add -
sudo apt update && sudo apt install bazel
One more thing
The configuration script assumes there’s a
python binary in your environment. By default, Ubuntu 18.04 does not come with Python 2 anymore, but the Python 3 binary is called
python3. To resolve this issue, I like to use
sudo update-alternatives --install /usr/bin/python python /usr/bin/python3 100 --slave /usr/bin/pip pip /usr/bin/pip3
This way whenever you call
python you get Python 3. 😊
Note that if you ever install Python 2,
python will continue to point to Python 3. Python 2 will be accessible via
Now, we can finally move on to the good stuff.
First, clone the Tensorflow repository:
git clone https://github.com/tensorflow/tensorflow
However, unlike what’s recommended in the official guide, you should stick with the master branch. The latest release (right now) is 1.8, and it has a bug that prevents some code from compiling with GCC 6. Apparently the official build is compiled with GCC 4.8, which is why they made an apparently broken release.
Anyway, the fix has been merged into master, so it should compile fine. In case you run into issues, I built it at commit #d0f5bc1 (there have been a lot of newer commits already, some of which may break something).
The next step is to run the configuration tool with
./configure. Here’s my inputs:
You have bazel 0.13.0 installed.
Please specify the location of python. [Default is /usr/bin/python]: /usr/bin/python3Found possible Python library paths:
Please input the desired Python library path to use. Default is [/usr/local/lib/python3.6/dist-packages]Do you wish to build TensorFlow with jemalloc as malloc support? [Y/n]: y
jemalloc as malloc support will be enabled for TensorFlow.Do you wish to build TensorFlow with Google Cloud Platform support? [Y/n]: n
No Google Cloud Platform support will be enabled for TensorFlow.Do you wish to build TensorFlow with Hadoop File System support? [Y/n]: n
No Hadoop File System support will be enabled for TensorFlow.Do you wish to build TensorFlow with Amazon S3 File System support? [Y/n]: n
No Amazon S3 File System support will be enabled for TensorFlow.Do you wish to build TensorFlow with Apache Kafka Platform support? [Y/n]: n
No Apache Kafka Platform support will be enabled for TensorFlow.Do you wish to build TensorFlow with XLA JIT support? [y/N]: y
XLA JIT support will be enabled for TensorFlow.Do you wish to build TensorFlow with GDR support? [y/N]: n
No GDR support will be enabled for TensorFlow.Do you wish to build TensorFlow with VERBS support? [y/N]: n
No VERBS support will be enabled for TensorFlow.Do you wish to build TensorFlow with OpenCL SYCL support? [y/N]: n
No OpenCL SYCL support will be enabled for TensorFlow.Do you wish to build TensorFlow with CUDA support? [y/N]: y
CUDA support will be enabled for TensorFlow.Please specify the CUDA SDK version you want to use, e.g. 7.0. [Leave empty to default to CUDA 9.0]: 9.1Please specify the location where CUDA 9.1 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:Please specify the cuDNN version you want to use. [Leave empty to default to cuDNN 7.0]: 7.1Please specify the location where cuDNN 7 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:Do you wish to build TensorFlow with TensorRT support? [y/N]: n
No TensorRT support will be enabled for TensorFlow.Please specify the NCCL version you want to use. [Leave empty to default to NCCL 1.3]: 2.1.15Please specify the location where NCCL 2 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:/usr/local/cuda/ncclPlease specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size. [Default is: 3.5,5.2]5.2Do you want to use clang as CUDA compiler? [y/N]: n
nvcc will be used as CUDA compiler.Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/x86_64-linux-gnu-gcc-7]: /usr/bin/gcc-6Do you wish to build TensorFlow with MPI support? [y/N]: n
No MPI support will be enabled for TensorFlow.Please specify optimization flags to use during compilation when bazel option "--config=opt" is specified [Default is -march=native]:Would you like to interactively configure ./WORKSPACE for Android builds? [y/N]: n
Not configuring the WORKSPACE for Android builds.Preconfigured Bazel build configs. You can use any of the below by adding "--config=<>" to your build command. See tools/bazel.rc for more details.
--config=mkl # Build with MKL support.
--config=monolithic # Config for mostly static monolithic build.
- Python path:
- GCC path:
- NCCL path:
- Check you latest CUDA compute capability.
NOTE: It has been reported that the newer commits require Keras to compile. Although this looks like a screwup by some dev, for now it’s best to avoid the issue and install Keras first:
pip install keras
Now, to compile, just run
bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
Note: if you get an error like
"C++ compilation of rule ‘@double_conversion//:double-conversion’ failed" , then it might be useful to pass the additional argument of
This step will probably take a long time. After it finishes, if all goes well, you can build the tensorflow package with
and to install
pip install /tmp/tensorflow_pkg/tensorflow*.whl
Check if your build is working by changing into another directory (
cd) and running
import tensorflow as tf
hello = tf.constant('Hello, Tensorflow!')
sess = tf.Session()
You should get
Hello, Tensorflow! as an output.
How to undo this
As much as you may have felt spooked by some commands, there’s not much damage being done to your system if you follow this guide. If you run into issues and wish to undo everything CUDA-related so you can restart from scratch or try something else, just run the following lines:
sudo rm /usr/include/nccl.h /usr/include/cudnn.h /usr/lib/x86_64-linux-gnu/libnccl* /usr/lib/x86_64-linux-gnu/libcudnn*
sudo rm -r /usr/local/cuda
sudo apt remove nvidia-cuda-toolkit libcupti-dev gcc-6
You may wish to omit the
nvidia-390 bit, since it’s usually a good idea to have the proprietary driver whether you are using CUDA or not.
To uninstall the Tensorflow package, use
pip remove tensorflow
If all went well, you now have an extremely optimized cutting-edge build of Tensorflow installed in your Ubuntu 18.04. Only thing that could make it faster is adding TensorRT, but I could not figure how to make it work with this setup. Feel free to make any suggestions or ask for help in the comments. I hope my guide will save some time for a few people (it took me the better part of a day to figure all of this out).