Set up an GPU instance (p2.xlarge: Ubuntu 16.04+k 80 GPU) for deep learning on AWS
As well known, GPUs in general can accelerate the training of neural network models. Most deep learning frameworks include GPU support. Instead of buying a GPU and managing a server on yourself, I find it fairly convenient to use one of the GPU instances on AWS, such as p2.xlarge.
This post gives an introduction on how to set up a GPU instance (p2.xlarge with Ubuntu 16.04+k80 GPU) for deep learning on AWS.
You will need to register an AWS account and request a p2.xlarge ubuntu 16.04 instance, which comes with a Nvidia k80 GPU and a 61 GB RAM. AWS provides two pricing options, “on-demand” and “spot”. I will try to create a separate post to compare them in the future. For now, you can start with “on-demand” and it’s $0.9/hour as of August 2017.
Then ssh
to the instance:
ssh -i your_aws_key.pem ubuntu@your_public_ip
Update the package list and upgrade the actual packages for your new server:
sudo apt-get update
sudo apt-get dist-upgrade
I find it very handy to use htop
to monitor the CPU and Memory usages. You can install it with apt-get
:
apt-get install htop
I always use anaconda
to do all the python repositories management and highly recommend you to do this:
cd /tmp
curl -O https://repo.continuum.io/archive/Anaconda2-4.3.0-Linux-x86_64.sh
bash Anaconda2–4.3.0-Linux-x86_64.sh
Create an “virtual environment” with the name of py2
using anaconda
and activate py2
:
conda create -n py2
source activate py2
To check your GPU information:
lspci -nnk | grep VGA -A8
This shows the following on a p2.xlarge instance:
00:1e.0 3D controller [0302]: NVIDIA Corporation GK210GL [Tesla K80] [10de:102d] (rev a1)
Subsystem: NVIDIA Corporation GK210GL [Tesla K80] [10de:106c]
Kernel driver in use: nvidia
Kernel modules: nvidia_375_drm, nvidia_375
Download a cuda internet installer, update cuda to apt-get
repository manager and install it using `apt-get`:
curl -O http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1604_8.0.61–1_amd64.deb
sudo apt-get update
sudo apt-get install cuda
Download cuDNN, extract and copy it to same directory where cuda is (Nvidia requests you to register to download cuDNN, so the following link to cuDNN will unlikely work for you.):
cd /tmp/
curl -O https://developer.nvidia.com/compute/machine-learning/cudnn/secure/v5.1/prod_20161129/8.0/cudnn-8.0-linux-x64-v5.1-tgz
sudo tar -xvf cudnn-8.0-linux-x64-v5.1-rc.tgz -C /usr/local
In run time, tensorflow looks up the the cuda and cuDNN libraries through a few environment variables. The following lines have to be added into the file of ~/.bashrc
.
export CUDA_HOME=/usr/local/cuda
export DYLD_LIBRARY_PATH=”$DYLD_LIBRARY_PATH:$CUDA_HOME/lib”
export PATH=”$CUDA_HOME/bin:$PATH”
export LD_LIBRARY_PATH=”$LD_LIBRARY_PATH:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64"
Run source ~/.bashrc
, so that the environment variables will be active.
The GPU version of tensorflow installed using conda
or pip
is often not optimized for your hardware. A better way is to build the binary installation package (a *.whl
file) using bazel
. First, you need to install bazel
and Oracle Java Development Kit (JDK) 8:
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer
Add Bazel distribution URI as a package source:
echo “deb [arch=amd64] http://storage.googleapis.com/bazel-apt stable jdk1.8” | sudo tee /etc/apt/sources.list.d/bazel.list
curl https://bazel.build/bazel-release.pub.gpg | sudo apt-key add -
Install Bazel:
sudo apt-get install bazel
Install TF dependency:
sudo apt-get install python-numpy python-dev python-pip python-wheel libcurl3-dev
Clone tensorflow
from GitHub and configure the installation:
git clone https://github.com/tensorflow/tensorflow.git
cd tensorflow
./configure
This will ask you to specify many options for later installation, including the path for python, the path for CUDA libraries, and so on.
- Please specify the location of python. [Default is /home/ubuntu/anaconda2/bin/python]: /usr/bin/python2.7
- Please input the desired Python library path to use. Default is [/usr/local/lib/python2.7/dist-packages] /usr/local/lib/python2.7/dist-packages
- Do you wish to build TensorFlow with MKL support? [y/N] y
- Do you wish to download MKL LIB from the web? [Y/n] y
- Please specify optimization flags to use during compilation when bazel option “ — config=opt” is specified [Default is -march=native]: -march=native
- Do you wish to use jemalloc as the malloc implementation? [Y/n] n
- Do you wish to build TensorFlow with Google Cloud Platform support? [y/N] n
- Do you wish to build TensorFlow with Hadoop File System support? [y/N] n
- Do you wish to build TensorFlow with the XLA just-in-time compiler (experimental)? [y/N] n
- Do you wish to build TensorFlow with VERBS support? [y/N] n
- Do you wish to build TensorFlow with OpenCL support? [y/N] n
- Do you wish to build TensorFlow with CUDA support? [y/N] y
- Do you want to use clang as CUDA compiler? [y/N] n
- Please specify the CUDA SDK version you want to use, e.g. 7.0. [Leave empty to use system default]: 8.0
- Please specify the location where CUDA 8.0 toolkit is installed. Refer to README.md for more details. [Default is /usr/local/cuda]: /usr/local/cuda
- Please specify which gcc should be used by nvcc as the host compiler. [Default is /usr/bin/gcc]:/usr/bin/gcc
- Please specify the cuDNN version you want to use. [Leave empty to use system default]: 5.1.10
- Please specify the location where cuDNN 5.1.10 library is installed. Refer to README.md for more details. [Default is /usr/local/cuda]:/usr/local/cuda
- Please specify a list of comma-separated Cuda compute capabilities you want to build with.
You can find the compute capability of your device at: https://developer.nvidia.com/cuda-gpus.
Please note that each additional compute capability significantly increases your build time and binary size.
[Default is: “3.5,5.2”]: 3.7
Note:
- With
anaconda
installed, the default python will be/home/ubuntu/anaconda2/bin/python
, you have to change it to/usr/bin/python2.7
. - I included almost all supports, that can enhance the running speed of tensorflow, except for
VERBS
, which gave me error when I tried to install it withpip
. - Running tensorflow with
XLA
support on GPU was found to be slightly slower than tensorflow withoutXLA
support!
Use bazel
to build a `build_pip_package` file and later use it to build the pip package.
bazel build -config=opt -config=cuda //tensorflow/tools/pip_package:build_pip_package
Note: it’s important to have the option -config=cuda
here, otherwise your tensorflow won’t have GPU support.
bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg
pip
install the *.whl
package at /tmp/tensorflow_pkg/*.whl
:
pip install /tmp/tensorflow_pkg/*.whl
(I’m assuming your terminal is still in the py2
virtual environment. Otherwise, run source activate py2
. Of course you will always have to activate the py2
virtual environment to use tensorflow in the future.)
Read another post of mine, Run tensorflow in Jupyter notebook on AWS, for how to develop and train your neural network models on AWS.