GCP + GPUs 💙 Kubernetes (and Tensorflow)

Published in

Google Cloud - Community

6 min readMay 4, 2017

When I wrote how to deploy Tensorflow on Kubernetes a couple of weeks ago, I took advantage of AWS as my cloud substrate. A reader commented and asked for instructions to setup the GPU cluster on the Google Cloud Platform.

I said I would answer to Vikas, so here it is, a method to deploy Kubernetes with GPUs on the Google Cloud Platform.

Requirements

To replicate this post, you will need:

Understanding of the tooling Canonical develops and uses: Ubuntu and Juju;
An admin account on GCP and enough quotas to add at least 3 GPUs ;
Understanding of the tooling for Kubernetes: kubectl;
An Ubuntu 16.04 or higher, CentOS 6+, MacOS X, or Windows 7+ machine, with Juju installed. This blog will focus on Ubuntu, but you can follow guidelines for other OSes via this presentation.

If you experience any issue in deploying this, or if you have specific requirements, connect with me on IRC. I am SaMnCo on Freenode #juju, and the rest of the CDK team is also available there to help.

Preparing your environment

First of all let’s deploy Juju on your machine as well as a couple of useful tools:

sudo add-apt-repository -y ppa:juju/devel
sudo apt update
sudo apt install -yqq juju jq git
export SDK_SRC=https://dl.google.com/dl/cloudsdk/channels/rapid/downloads
export SDK_VERSION=154.0.0
wget ${SDK_SRC}/google-cloud-sdk-${SDK_VERSION}-linux-x86_64.tar.gz
tar xfz google-cloud-sdk-${SDK_VERSION}-linux-x86_64.tar.gz && \
  google-cloud-sdk/install.sh
# This is interactive
rm google-cloud-sdk-${SDK_VERSION}-linux-x86_64.tar.gz

Follow the instructions from this page to prepare a project and your GCP credentials for Juju, and complete with:

juju add-credential google

The above line is interactive and will guide you through the process. Finally, let’s download kubectl and helm

# kubectl
curl -LO https://storage.googleapis.com/kubernetes-release/release/1.6.2/bin/linux/amd64/kubectl
chmod +x kubectl && sudo mv kubectl /usr/local/bin/

Clone this repository to access the source documents:

git clone https://github.com/madeden/blogposts.git
cd blogposts/k8s-tensorflow

OK! We’re good to go.

Deploying the cluster

As we want to use GPUs, exactly as for AWS, we need to be careful about the AZ we are using. For example, in us-east1, only us-east1-d has GPU enablement. Google provides documentation about the locations on this page.

GCE has a very specific way of managing AZs and subnets, and also doesn’t provide instance types for GPU enabled machines (GPUs can be added to pretty much any instance type at boot only).

Therefore, in order to deploy K8s on GPUs on GCE, you have to

Bootstrap and deploy the control plane of Kubernetes
Create the GPU instances manually
Add these machines to Juju once they are started
Tell Juju to deploy the worker on them

This forbids the use of a bundle, so most of the deployment will be manual. Let’s see how that works:

juju bootstrap google/us-east1 
juju add-model k8s

Manually deploy the control plane and a first worker for CPU only base

juju deploy cs:~containers/kubernetes-master-17
juju deploy cs:~containers/etcd-29 --to 0
juju deploy cs:~containers/easyrsa-8 --to lxd:0
juju deploy cs:~containers/flannel-13
juju deploy cs:~containers/kubernetes-worker-22
juju expose kubernetes-master
juju expose kubernetes-worker

Add Juju SSH key to the project

gcloud compute project-info add-metadata \
 --metadata-from-file sshKeys=~/.local/share/juju/ssh/juju_id_rsa_gce.pub

Note: this .pub is a copy of the default juju_id_rsa.pub file adapted for gce, looking like

ubuntu:ssh-rsa [KEY VALUE] ubuntu

See the official documentation about this here. Now create all machines with:

for i in $(seq 1 1 3)
do 
 gcloud beta compute instances create kubernetes-worker-gpu-${i} \
  --machine-type n1-standard-2 \
  --zone us-east1-d \
  --accelerator type=nvidia-tesla-k80,count=1 \
  --image-family ubuntu-1604-lts \
  --image-project ubuntu-os-cloud \
  --maintenance-policy TERMINATE \
  --metadata block-project-ssh-keys=FALSE \
  --restart-on-failure
 sleep 5
done

For each machine, we get an answer like:

Created [https://www.googleapis.com/compute/beta/projects/jaas-151616/zones/us-east1-d/instances/kubernetes-worker-gpu-2].
NAME                     ZONE        MACHINE_TYPE   PREEMPTIBLE  INTERNAL_IP  EXTERNAL_IP   STATUS
kubernetes-worker-gpu-2  us-east1-d  n1-standard-2               10.142.0.5   35.185.74.56  RUNNING

Take good note of the public IP of each, then, for each machine

juju add-machine ssh:ubuntu@35.185.74.56 # use the Public IP

You'll get answers like

WARNING Skipping CA server verification, using Insecure option
created machine 2

At this stage, the status of Juju will look like:

$ juju status
Model  Controller  Cloud/Region     Version      SLA
k8s    k8s         google/us-east1  2.2-beta4.1  unsupportedApp                Version  Status   Scale  Charm              Store       Rev  OS      Notes
easyrsa            3.0.1    active       1  easyrsa            jujucharms    8  ubuntu  
etcd               2.3.8    blocked      1  etcd               jujucharms   29  ubuntu  
kubernetes-master  1.6.1    blocked      1  kubernetes-master  jujucharms   17  ubuntu  
kubernetes-worker  1.6.1    blocked      1  kubernetes-worker  jujucharms   22  ubuntuUnit                  Workload  Agent  Machine  Public address  Ports  Message
easyrsa/0*            active    idle   0/lxd/0  10.0.19.96             Certificate Authority ready.
etcd/0*               blocked   idle   0        35.185.1.113           Missing relation to certificate authority.
kubernetes-master/0*  blocked   idle   0        35.185.1.113           Relate kubernetes-master:kube-control kubernetes-worker:kube-control
kubernetes-worker/0*  blocked   idle   1        35.185.118.158         Relate kubernetes-worker:kube-control kubernetes-master:kube-controlMachine  State    DNS             Inst id                Series  AZ          Message
0        started  35.185.1.113    juju-f1c96a-0          xenial  us-east1-b  RUNNING
0/lxd/0  started  10.0.19.96      juju-f1c96a-0-lxd-0    xenial              Container started
1        started  35.185.118.158  juju-f1c96a-1          xenial  us-east1-c  RUNNING
2        started  35.185.22.159   manual:35.185.22.159   xenial              Manually provisioned machine
3        started  35.185.74.56    manual:35.185.74.56    xenial              Manually provisioned machine
4        started  35.185.112.159  manual:35.185.112.159  xenial              Manually provisioned machineRelation  Provides  Consumes  Type
cluster   etcd      etcd      peer

You can note that we have machines 2 to 4 being added as manual instances. Now we can tell juju to use these for the additional workers:

# Adding all machines as workers
for unit in $(seq 2 1 4)
do
 juju add-unit kubernetes-worker --to ${unit}
done# Add Relations between charms
juju add-relation kubernetes-master:kube-api-endpoint  kubernetes-worker:kube-api-endpoint
juju add-relation kubernetes-master:kube-control  kubernetes-worker:kube-control
juju add-relation kubernetes-master:certificates  easyrsa:client
juju add-relation kubernetes-master:etcd  etcd:db
juju add-relation kubernetes-worker:certificates  easyrsa:client
juju add-relation etcd:certificates  easyrsa:client
juju add-relation flannel:etcd  etcd:db
juju add-relation flannel:cni  kubernetes-master:cni
juju add-relation flannel:cni  kubernetes-worker:cni# Watch results
watch -c juju status --color

CUDA being deployed on manually deployed GPU-enabled GCP instances

Now we can test that CUDA is installed with

juju ssh 2 "sudo nvidia-smi"
Thu May  4 07:20:37 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.51                 Driver Version: 375.51                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000:00:04.0     Off |                    0 |
| N/A   74C    P0    77W / 149W |      0MiB / 11439MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
Connection to 35.185.22.159 closed.

Now let’s see how our cluster looks:

juju scp kubernetes-master/0:config ~/.kube/config
kubectl get nodes --show-labels
NAME                      STATUS    AGE       VERSION   LABELS
juju-f1c96a-1             Ready     28m       v1.6.1    beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,kubernetes.io/hostname=juju-f1c96a-1
kubernetes-worker-gpu-1   Ready     17m       v1.6.1    beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,cuda=true,gpu=true,kubernetes.io/hostname=kubernetes-worker-gpu-1
kubernetes-worker-gpu-2   Ready     17m       v1.6.1    beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,cuda=true,gpu=true,kubernetes.io/hostname=kubernetes-worker-gpu-2
kubernetes-worker-gpu-3   Ready     17m       v1.6.1    beta.kubernetes.io/arch=amd64,beta.kubernetes.io/os=linux,cuda=true,gpu=true,kubernetes.io/hostname=kubernetes-worker-gpu-3

As you can see, labels have been added to the nodes with cuda=true and gpu=true. Now as usual we can deploy our nvidia-smi job

kubectl create -f ./src/nvidia-smi.yaml

Which, after some time, gives us the logs:

Conclusion

Now that we have a proper cluster with GPUs on Kubernetes, the rest is similar as my other experiments. Using Helm, Kubernetes provides a proper abstraction of the cloud layer, so you can safely use it to deploy your favorite GPU workload on top!

Enjoy! If you liked this or found it useful, don’t hesitate to push the little heart button, it always helps :)

GCP + GPUs 💙 Kubernetes (and Tensorflow)

Requirements

Preparing your environment

Deploying the cluster

Conclusion

Written by Samuel Cozannet