Set up an GPU instance (p2.xlarge: Ubuntu 16.04+k 80 GPU) for deep learning on AWS

Roger Xu Jiang
5 min readMay 9, 2017

--

As well known, GPUs in general can accelerate the training of neural network models. Most deep learning frameworks include GPU support. Instead of buying a GPU and managing a server on yourself, I find it fairly convenient to use one of the GPU instances on AWS, such as p2.xlarge.

This post gives an introduction on how to set up a GPU instance (p2.xlarge with Ubuntu 16.04+k80 GPU) for deep learning on AWS.

You will need to register an AWS account and request a p2.xlarge ubuntu 16.04 instance, which comes with a Nvidia k80 GPU and a 61 GB RAM. AWS provides two pricing options, “on-demand” and “spot”. I will try to create a separate post to compare them in the future. For now, you can start with “on-demand” and it’s $0.9/hour as of August 2017.

Then ssh to the instance:

ssh -i your_aws_key.pem ubuntu@your_public_ip

Update the package list and upgrade the actual packages for your new server:

sudo apt-get update 
sudo apt-get dist-upgrade

I find it very handy to use htop to monitor the CPU and Memory usages. You can install it with apt-get:

apt-get install htop

I always use anaconda to do all the python repositories management and highly recommend you to do this:

cd /tmp
curl -O https://repo.continuum.io/archive/Anaconda2-4.3.0-Linux-x86_64.sh
bash Anaconda2–4.3.0-Linux-x86_64.sh

Create an “virtual environment” with the name of py2 using anaconda and activate py2:

conda create -n py2
source activate py2

To check your GPU information:

lspci -nnk | grep VGA -A8

This shows the following on a p2.xlarge instance:

00:1e.0 3D controller [0302]: NVIDIA Corporation GK210GL [Tesla K80] [10de:102d] (rev a1)
Subsystem: NVIDIA Corporation GK210GL [Tesla K80] [10de:106c]
Kernel driver in use: nvidia
Kernel modules: nvidia_375_drm, nvidia_375

Download a cuda internet installer, update cuda to apt-get repository manager and install it using `apt-get`:

curl -O http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1604_8.0.61–1_amd64.deb
sudo apt-get update
sudo apt-get install cuda

Download cuDNN, extract and copy it to same directory where cuda is (Nvidia requests you to register to download cuDNN, so the following link to cuDNN will unlikely work for you.):

cd /tmp/
curl -O https://developer.nvidia.com/compute/machine-learning/cudnn/secure/v5.1/prod_20161129/8.0/cudnn-8.0-linux-x64-v5.1-tgz
sudo tar -xvf cudnn-8.0-linux-x64-v5.1-rc.tgz -C /usr/local

In run time, tensorflow looks up the the cuda and cuDNN libraries through a few environment variables. The following lines have to be added into the file of ~/.bashrc.

export CUDA_HOME=/usr/local/cuda
export DYLD_LIBRARY_PATH=”$DYLD_LIBRARY_PATH:$CUDA_HOME/lib”
export PATH=”$CUDA_HOME/bin:$PATH”
export LD_LIBRARY_PATH=”$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64"

Run source ~/.bashrc, so that the environment variables will be active.

The GPU version of tensorflow installed using conda or pip is often not optimized for your hardware. A better way is to build the binary installation package (a *.whl file) using bazel. First, you need to install bazel and Oracle Java Development Kit (JDK) 8:

sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer

Add Bazel distribution URI as a package source:

echo “deb [arch=amd64] http://storage.googleapis.com/bazel-apt stable jdk1.8” | sudo tee /etc/apt/sources.list.d/bazel.list
curl https://bazel.build/bazel-release.pub.gpg | sudo apt-key add -

Install Bazel:

sudo apt-get install bazel

Install TF dependency:

sudo apt-get install python-numpy python-dev python-pip python-wheel libcurl3-dev

Clone tensorflow from GitHub and configure the installation:

git clone https://github.com/tensorflow/tensorflow.git
cd tensorflow
./configure

This will ask you to specify many options for later installation, including the path for python, the path for CUDA libraries, and so on.

  1. Please specify the location of python. [Default is /home/ubuntu/anaconda2/bin/python]: /usr/bin/python2.7
  2. Please input the desired Python library path to use. Default is [/usr/local/lib/python2.7/dist-packages] /usr/local/lib/python2.7/dist-packages
  3. Do you wish to build TensorFlow with MKL support? [y/N] y
  4. Do you wish to download MKL LIB from the web? [Y/n] y
  5. Please specify optimization flags to use during compilation when bazel option “ — config=opt” is specified [Default is -march=native]: -march=native
  6. Do you wish to use jemalloc as the malloc implementation? [Y/n] n
  7. Do you wish to build TensorFlow with Google Cloud Platform support? [y/N] n
  8. Do you wish to build TensorFlow with Hadoop File System support? [y/N] n
  9. Do you wish to build TensorFlow with the XLA just-in-time compiler (experimental)? [y/N] n
  10. Do you wish to build TensorFlow with VERBS support? [y/N] n
  11. Do you wish to build TensorFlow with OpenCL support? [y/N] n
  12. Do you wish to build TensorFlow with CUDA support? [y/N] y
  13. Do you want to use clang as CUDA compiler? [y/N] n
  14. Please specify the CUDA SDK version you want to use, e.g. 7.0. [Leave empty to use system default]: 8.0
  15. Please specify the location where CUDA 8.0 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: /usr/local/cuda
  16. Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]:/usr/bin/gcc
  17. Please specify the cuDNN version you want to use. [Leave empty to use system default]: 5.1.10
  18. Please specify the location where cuDNN 5.1.10 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:/usr/local/cuda
  19. Please specify a list of comma-separated Cuda compute capabilities you want to build with.
    You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
    Please note that each additional compute capability significantly increases your build time and binary size.
    [Default is: “3.5,5.2”]: 3.7

Note:

  1. With anaconda installed, the default python will be /home/ubuntu/anaconda2/bin/python, you have to change it to /usr/bin/python2.7.
  2. I included almost all supports, that can enhance the running speed of tensorflow, except for VERBS, which gave me error when I tried to install it with pip.
  3. Running tensorflow with XLA support on GPU was found to be slightly slower than tensorflow without XLA support!

Use bazel to build a `build_pip_package` file and later use it to build the pip package.

bazel build -config=opt -config=cuda //tensorflow/tools/pip_package:build_pip_package 

Note: it’s important to have the option -config=cuda here, otherwise your tensorflow won’t have GPU support.

bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg

pip install the *.whl package at /tmp/tensorflow_pkg/*.whl:

pip install /tmp/tensorflow_pkg/*.whl

(I’m assuming your terminal is still in the py2 virtual environment. Otherwise, run source activate py2. Of course you will always have to activate the py2 virtual environment to use tensorflow in the future.)

Read another post of mine, Run tensorflow in Jupyter notebook on AWS, for how to develop and train your neural network models on AWS.

--

--