Battling through an Nvidia GPU setup in the name of machine learning.

Samir Moussa
8 min readNov 14, 2017

--

Objective: Install an Nvidia GeForce GTX 1080 Ti with CUDA 9.0.
Audience: Know basic Linux.

It was always going to be tricky, but usually having a wonder through Ask Ubuntu, Stack Overflow or Nvidia dev blogs the experience of setting up a custom-built GPU machine would have been less brutal. But the problem I faced was to do with understanding the function of numerous libraries circling the installation process and knowing which information was applicable as technology continuously changes and help online becomes outdated.

I recently (August ‘17) purchased a new GPU rig for just under £1700 to do some machine learning and deep learning research. It took me a day to assemble and several days to setup software. It was my first every computer build so I had to make sure I didn’t screw things up. Most of my time was spent not on plugging in hardware but orchestrating software installation. It frustrated me so here we are.

This post aims to make sense about how one can install CUDA, from what drivers mean to why GPU tests are making your 1080 look like a 670. I’ll explain why trying to repurpose hardware is most definitely going to mess things up — and you’re most definitely going to mess things up, so I’ve provided info on how to restart. Take all of this information in with a pinch of salt, because by the time you finish reading, it’ll no longer be relevant — or at least some of it.

Some photos. (Note: I decided not to use Ubuntu but Debian instead)

GPU stands for Graphics Processing Unit. It’s for graphics. It’s to help you play high quality games without the lag through beastly parallel processing. But you’re not going to be using it for gaming or graphics are you? Right now it’s plugged into your motherboard and only your BIOS and Linux kernel know about it.

lspci | grep -i nvidia

Do you see it?

BIOS: Basic Input Output System. It’s firmware (really software) already on your motherboard that directly controls your hardware, like overclocking, fan speed and pretty lights.

You most likely also have a monitor which needs to be connected. You have two options: plug its HDMI or VGA cable directly into the protruding slots of your GPU at the rear of your machine or use the motherboard’s display slots. Do the latter. To fully utilise the GPU’s processing power, you’d like to have your monitor use your CPU’s built-in graphics component while saving CUDA operations and cat predictions to be performed on a dedicated computing device, the Nvidia card.

Luckily, your monitor picks up the default driver already installed on the motherboard chipset, Intel CPU die or AMD whatever. In any case, if you’ve specifically bought a chipset without integrated graphics, then you have no option but to use your Nvidia card for visual display. Let’s assume you have an integrated graphics component.

Device driver: A piece of software required in order to interact with the device. Each device has its own set of drivers. Some versions of software will only work on specific versions of drivers.

Let’s say you’re logged in and are now sitting in front of a blinking terminal. Your OS should know nothing about any Nvidia drivers because you haven’t installed anything yet. If it does, it means your OS or some libraries have interfered by setting default configurations during their installation. I moved from Ubuntu to Debian 9 “Stretch” so that I wouldn’t have to deal with uninstalling bloatware and to keep track of what is going on by installing things myself.

Our objective at the moment is to install two components: the Nvidia proprietary device driver and CUDA. First we must uninstall any existing Nvidia device drivers and CUDA toolkits. If your computer explodes during installation, come up here to start afresh.

sudo /usr/local/cuda-X.Y/bin/uninstall_cuda_9.0.pl
sudo /usr/bin/nvidia-uninstall
sudo apt-get remove --purge nvidia*
sudo apt-get autoremove
sudo reboot

There are many ways one can install libraries or packages — through package managers (apt, apt-get, aptitude) or simple shell scripts. Nvidia has various packages for various distributions but let’s cut out the middle man by downloading and using their runfile (click appropriately). The runfile works for a range of Linux distributions but I just went through to the Ubuntu section because.

After downloading, execute the command bel—wait!

sudo sh ~/Downloads/cuda_9.0.176_384.81_linux.run

If you execute the command above, the Nvidia installer will throw a fit complaining you’re already running an X server.

Wot.

The Nvidia device driver (or CUDA driver as Nvidia calls it) cannot be installed when at least one of two things is present: some other open source driver is taking its place, like Nouveau, or an existing graphical user interface is active (you are logged in and can see your wallpaper). We need to stop both, so let’s start with the first.

Nouveau: An open-source graphics device driver for Nvidia cards sitting as a placeholder for Nvidia’s proprietary driver. We don’t really want to use this.

You probably have Nouveau installed but we can blacklist it. Your system will then give precedence to your Nvidia’s proprietary driver when it is installed. If you run the command below and see some output, you have Nouveau loaded:

lsmod | grep nouveau

So edit /etc/modprobe.d/blacklist-nouveau.conf with vim or something:

# inside blacklist-nouveau.confblacklist nouveau
options nouveau modeset=0

Save and exit the file. Regenerate the RAM initialisation image:

sudo update-initramfs -u
sudo reboot

We’ve dealt with the driver placeholder and now we need to disable the GUI (screen). Remember, we can’t install the Nvidia driver or CUDA without disabling the display. So our task is to disable the GUI, enter text mode, install things then recover the GUI. When you’re at your login screen, press CTRL+ALT+F1 or CTRL+ALT+F4 to enter text mode. Now log in.

It’s important to understand how the computer decides to display your GUI. There are these things called display managers and another thing called X.

Xorg (or X Window System or X11 or X) server: Software that manages GUI windows and allows you to do clicky things. It’s a server run by a display manager and needs to be constantly running in the background.
More information

Your desktop environment (Unity, GNOME, KDE, etc.) comes with a display manager (lighdm, gdm, etc.) that starts up the X server when you first switch on the computer. You get the login screen, you log in and so on. By default, your Linux machine will already be running one of these display managers. We want to temporarily disable it.

You can see which one is running like this:

cat /etc/X11/default-display-manager

Where is this all configured? At /etc/X11/xorg.conf obviously. But I don’t have this file! Well yeah because nowadays, depending on your Linux distribution and version, you might have a package called Bumblebee.

Bumblebee: Linux version of Nvidia Optimus which dynamically manages power to the GPU. It’s a little server that takes control over which driver to use for which graphics card for which monitor. Bumblebee = xorg.conf(s) + battery management stuff.

You might not have Bumblebee. In fact you don’t need it because it’s best used in laptops which have abysmal battery lives. One might have multiple drivers, GPUs and configuration files and Bumblebee manages them all.

The point here is that we first need to disable the running X server (the graphics display) and then edit something like an xorg.conf file which might be telling our computer to, by default, use an Nvidia driver (Nouveau or Nvidia’s proprietary one) and GPU for display. No. Use my integrated graphics component for display thank you very much.

We’ve already blacklisted Nouveau, so to disable the X server:

sudo /etc/init.d/gdm3 stop

If you have /etc/X11/xorg.conf, /etc/bumblebee/bumblebee.conf or something similar, make sure the default driver for your first (main) monitor is whatever your integrated graphics driver name is. In my case it’s intel. Nvidia’s one is (wait for it) nvidia.

Stuck? Have a look here at lines 22, 55 and 67.

Finally! We‘re ready to install the Nvidia driver and CUDA toolkit. There’s also sample code that comes with CUDA which will be useful later so install that too. Just bear in mind a few things before you continue. The next part will ask you if you want to install OpenGL. Don’t do that. Because your integrated graphics might already be using OpenGL or your system might have OpenGL configurations already set. Installing it at this point will overwrite those settings. And because we want our integrated graphics component rather than our GPU to render our cat videos, skip messing with OpenGL. You can do what you like with it later.

Someone somewhere might tell you to run nvidia-xconfig. Don’t do this because we’ve already configured our system’s X configuration at /etc/X11/xorg.conf (or /etc/bumblebee/bumblebee.conf). Running nvidia-xconfig will overwrite those changes.

sudo sh ~/Downloads/cuda_9.0.176_384.81_linux.run
(... do stuff ...)
sudo reboot

Now you’re good to go. What’ll happen is your display will use your integrated graphics component while CUDA programs will be able to run on your new shiny GPU through the Nvidia device driver.

But wait! Ughhh, what noowww??!

There’s some final path linking and testing to be done. Trust me, you’ll want to test your GPU to understand why it’s not working to its official specs. I found the Nvidia Installation Guide for Linux to be very helpful.

Add these to your path in .bashrc (or .zshrc, depending on your shell):

export PATH=/usr/local/cuda-9.0/bin${PATH:+:${PATH}}export LD_LIBRARY_PATH=/usr/local/cuda-<some-version>/lib64\
${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
export CUDA_VISIBLE_DEVICES=0

If you also downloaded the CUDA sample code files during your installation then you’ll find a ~/Downloads/NVIDIA_CUDA-9.0_Samples/ directory. It has a set of code snippets that run device queries, math operations and simulations. You’ll find these useful for probing your device and looking at C++ code utilising CUDA.

Now test your GPU performance:

cd ~/NVIDIA_CUDA-9.0_Samples/1_Utilities/bandwidthTest
make
./bandwidthTest

I noticed my Host to Device Bandwidth was clocking at around 6.5 GB/s. It’s the speed at which data can be moved between the host and device. Surely it’s supposed to be over 11 GB/s? I finally figured out that the test was also taking into account device attachment time.

Typically, the GPU will be attached when one or more jobs are using the device. In this case its device file will be opened. Otherwise, the device file is closed. It’s like a server automagically starting to serve requests and shutting down when its not in use. This is called persistence mode.

The quick bandwidth test was also taking into account device file open time. Running sudo nvidia-persistenced puts the device into persistence mode. Now try the bandwidth test. Phew. Usually this is not an issue with everyday tasks because processes that use the GPU will be placed on its running process list and the device file won’t be closed until the process list is empty.

Er, I have nothing more to say. Enjoy your new toy!

Please share if you liked reading this post. Is there anything else you’d like me to write about? Any questions? Let me know!

Follow me @samir_moussa

--

--