Kubernetes + GPUs 💙 Tensorflow

A little while ago I wrote a series of blog posts about Deep Learning and Kubernetes, using the Canonical Distribution of Kubernetes (CDK) on AWS and Bare Metal to deploy Tensorflow. These posts were highly technical, fairly long and difficult to replicate.

Yesterday I published another post, explaining how the recent addition of GPUs as first class citizens in CDK changed the game. This last post has had a lot of success. Thank you all for that, it is always pleasant to see that content reaches its audience and that people are interested in your work.

The wide audience it reached also led me to conclude that I should revisit my previous work in light of all the comments and feedback I got, and explain how having trivialized GPUs access in Kubernetes can help scientists to deploy scalable, long running Tensorflow jobs (or any other deep learning afaiac).

This post is the result of that thought process.

The plan

So today, we will:

  1. Deploy a GPU enabled Kubernetes cluster, made of 1 CPU node and 3 GPU-enabled nodes;
  2. Add Storage to the cluster, via EFS;
  3. Deploy a Tensorflow workload using Helm;
  4. Learn how you can adapt this for your own Tensorflow code.

For the sake of clarity, there is nothing new here, it is an updated, condensed version of my series about Tensorflow on Kubernetes Part 1, Part 2 and Part 3 that benefits from the latest and greatest additions to the Canonical Distribution of Kubernetes.


To replicate this post, you will need:

  • Understanding of the tooling Canonical develops and uses: Ubuntu and Juju;
  • An admin account (Key and Secret) on AWS and enough credits to run p2 instances for a little while;
  • Understanding of the tooling for Kubernetes: kubectl and helm;
  • An Ubuntu 16.04 or higher, CentOS 6+, MacOS X, or Windows 7+ machine. This blog will focus on Ubuntu, but you can follow guidelines for other OSes via this presentation.

If you experience any issue in deploying this, or if you have specific requirements (non default VPCs, subnets…), connect with us on IRC. I am SaMnCo on Freenode #juju, and the rest of the CDK team is also available there to help.

Preparing your environment

First of all let’s deploy Juju on your machine as well as a couple of useful tools:

sudo add-apt-repository -y ppa:juju/devel
sudo apt update
sudo apt install -yqq juju jq git python-pip
sudo pip install --upgrade --user awscli

Now let’s add credentials for AWS so you can deploy there:

juju add-credential aws

The above line is interactive and will guide you through the process. Finally, let’s download kubectl and helm

# kubectl
curl -LO https://storage.googleapis.com/kubernetes-release/release/1.6.2/bin/linux/amd64/kubectl
chmod +x kubectl && sudo mv kubectl /usr/local/bin/
# helm
wget https://kubernetes-helm.storage.googleapis.com/helm-v2.3.1-linux-amd64.tar.gz
tar xfz helm-v2.3.1-linux-amd64.tar.gz
chmod +x linux-amd64/helm && sudo mv linux-amd64/helm /usr/local/bin/
rm -rf helm-v2.3.1-linux-amd64.tar.gz linux-amd64

Clone the charts repository

git clone https://github.com/madeden/charts.git

Clone the Docker Images repository for inspiration:

git clone https://github.com/madeden/docker-images.git

Clone this repository to access the source documents:

git clone https://github.com/madeden/blogposts.git
cd blogposts/k8s-tensorflow

OK! We’re good to go.

Deploying the cluster

First of all, we need to bootstrap in a region that is GPU-enabled, such as us-east-1

juju bootstrap aws/us-east-1

Now deploy with :

juju deploy src/bundles/k8s-tensorflow.yaml

This will take some time. You can track how the deployment converges with :

watch -c juju status --color

At the end, you should see all units “idle”, all machines “started” and it should look green.

Juju status with nice green colors

At this point in time, as CUDA enablement is now fully automated, we can safely assume that our cluster is ready to operate GPU workloads. We can download the configuration:

mkdir -p ~/.kube
juju scp kubernetes-master/0:config ~/.kube/config

and query the cluster state:

kubectl cluster-info
Kubernetes master is running at
Heapster is running at
KubeDNS is running at
kubernetes-dashboard is running at
Grafana is running at
InfluxDB is running at

The default username:password for the dashboard or Graphana is “admin:admin”.

Let’s have a look:

Fresh Kube UI after deployment

Connecting an EFS drive

First of all you need to know the VPC ID where you are deploying. You can access it from the AWS UI, or with

export VPC_ID=$(aws ec2 describe-vpcs | jq -r '.[][].VpcId')

Now we need to know the subnets where our instances are deployed:

export SUBNET_IDS=$(aws ec2 describe-subnets --filter Name=vpc-id,Values=vpc-b4ce2bd1 | jq -r '.[][].SubnetId')

And now we need the list of Security Groups to add to our EFS access:

export SG_ID=$(aws ec2 describe-instances --filter "Name=tag:Name,Values=juju-default-machine-0" | jq -r '.[][].Instances[].SecurityGroups[0].GroupId')

At this point, if have an EFS already deployed, just store its ID in EFS_ID, and make sure you delete all its mount points. If you do not have an EFS, create one with :

export EFS_ID=$(aws efs create-file-system --creation-token $(uuid)  | jq --raw-output '.FileSystemId')

And finally create the mount endpoint with :

for subnet in ${SUBNET_IDS}
aws --profile canonical --region us-east-1 efs create-mount-target \
--file-system-id ${EFS_ID} \
--subnet-id ${subnet} \
--security-groups ${SG_ID}

From the UI, you can now check that the EFS has mount points available from the Juju Security Group, which should look like juju-3d57d67a-8603–4161–8fc2-dc7e1ee08eef.

Note: users of this Tensorflow use case have reported that EFS is a little slow, and that other methods such as SSD EBS are faster. Consider this the easy demo path. If you have an advanced, IO intensive training, then let me know, and we’ll sort you out. This is something we are already working on.

Connecting EFS as a Persistent Volume in Kubernetes

We are now done with the Juju and AWS side, and can focus on deploying applications via the Helm Charts. Let’s switch to our charts folder:

cd ../../charts

Now let’s create our configuration file for the EFS volume. Copy efs/values.yaml to ./efs.yaml

cp efs/values.yaml ./efs.yaml

and update the id of the EFS volume. It shall look like:

region: us-east-1
id: fs-47cd610e
name: tensorflow-fs
accessMode: ReadWriteMany
capacity: "900Gi"
request: "750Gi"

Now deploy it with Helm

helm install efs --name efs --values ./efs.yaml

It will output something like:

NAME:   efs
LAST DEPLOYED: Thu Apr 20 13:57:37 2017
NAMESPACE: default
==> v1/PersistentVolumeClaim
tensorflow-fs Bound tensorflow-fs 900Gi RWX 1s
==> v1/PersistentVolume
tensorflow-fs 900Gi RWX Retain Bound default/tensorflow-fs 1s
This chart has created
* a PV tensorflow-fs of 900Gi capacity in ReadWriteMany mode
* a PVC tensorflow-fs of 750Gi in ReadWriteMany mode
They consume the EFS volume fs-47cd610e.efs.us-east-1.amazonaws.com

In the KubeUI, you can see that the PVC is “bound”, which means it is connected and available for consumption by other pods.

KubeUI to check PVC is bound correctly

Running an example Distributed CNN

There is a non GPU workload available in the charts repo called “distributed-cnn”. It is an example I took from a Google Tensorflow Workshop and adapted to port it to Helm.

Let’s copy the values.yaml file and adapt it:

cp distributed-cnn/values.yaml ./dcnn.yaml

and update the id of the EFS volume. The different sections are:

  • Data Initialization: this section configures a job to download the data into our EFS drive and uncompress it. If you already have run this example, you can set the “isRequired” to false, and it won’t attempt to run it again. If it’s your first time, keep it to “true”
imagePullPolicy: IfNotPresent
name: tensorflow-fs
isRequired: true
name: load-data
url: https://storage.googleapis.com/oscon-tf-workshop-materials/processed_reddit_data/news_aww/prepared_data.tar.gz
repo: gcr.io/google-samples
name: tf-workshop
dockerTag: v2
  • Cluster Configuration: This tells if we want to use GPUs, and how many Parameter Servers and Workers we want to use. Note that the code doesn’t support GPU, this is really an example of what would happen from a configuration standpoint.
name: tensorflow-cluster
internalPort: 8080
externalPort: 8080
type: ClusterIP
repo: gcr.io/google-samples
name: tf-worker-example
dockerTag: latest
isGpu: false
ps: 8
worker: 16
  • And the Tensorboard ingress endpoint to monitor the results. Here there is a DNS required. Set it to the result of
echo $(juju status kubernetes-worker-cpu --format json | jq -r '.machines[]."dns-name"').xip.io

which will leverage the Xip.io service to connect later on. The section of the file itself:

replicaCount: 1
repo: gcr.io/tensorflow
name: tensorflow
dockerTag: 1.1.0-rc2
name: tensorboard
type: ClusterIP
externalPort: 6006
internalPort: 6006
command: '["tensorboard", "--logdir", "/var/tensorflow/output"]'
cpu: 1000m
memory: 1Gi
cpu: 4000m
memory: 8Gi

Now deploy it with Helm

helm install distributed-cnn --name dcnn --values ./dcnn.yaml

depending on the number of workers and parameter servers, the output will be more or less extensive. At the end, you can see the indications:

This chart has deployed a cluster a Distributed CNN. It has:
* 8 parameter servers
* 16 workers
Enjoy Tensorflow on when the service is fully deployed, which takes a few minutes.
You can monitor the status with "kubectl get pods"

That’s it, you can connect to your tensorboard and see the output.

Tensorboard after several trainings and re trainings with different setups

If you want to compare results, just create several values.yaml files and upgrade your cluster:

helm upgrade dcnn distributed-cnn --values ./new-values-files.yaml

When you are happy and this and ready to get rid of it, you can delete it with

helm delete dcnn --purge

Running your own code

It’s always nice to understand things by copying what others have already done. That’s open source. It’s also the fastest way to get things done in 90% of the time. Don’t reinvent the wheel.

Then the next step is the DIY, and, if you are reading this post now, you probably want to run your own Tensorflow code.

So start a new Dockerfile, and fill it with:

FROM tensorflow/tensorflow:1.1.0-rc2-gpu
ADD worker.py worker.py
CMD ["python", "worker.py"]

Adapt this to your need (version…), and write this worker.py file that you’ll need. Make sure of:

  1. Following the guidelines for distributed tensorflow
  2. Expect that the environment variable POD_NAME will contain your job name and id (worker or ps, then the task index with the format worker-0)
  3. Expect that the environment variable CLUSTER_CONFIG will contain your cluster configuration

The example Google gives looks like:

import tensorflow as tf
import os
import logging
import sys
import ast
root = logging.getLogger()
ch = logging.StreamHandler(sys.stdout)
POD_NAME = os.environ.get('POD_NAME')
def main(job_name, task_id, cluster_def):
server = tf.train.Server(
if __name__ == '__main__':
this_job_name, this_task_id, _ = POD_NAME.split('-', 2)
cluster_def = ast.literal_eval(CLUSTER_CONFIG)
main(this_job_name, int(this_task_id), cluster_def)

Once you have that ready, build and publish your images. You can have a CPU only image for the Parameter Servers, and a GPU one for the workers, or just one of them. Up to you to decide.

Note: Example Dockerfiles are available in the docker-image repository for this, with several version of Tensorflow

Then copy the values.yaml file from the tensorflow chart, and adapt it by pointing the cluster to your images. Then follow the same method as before

helm install tensorflow --name fs --values ./tf.yaml

Tearing down

To tear down the Juju cluster once you are done:

juju destroy-controller aws-us-east-1 --destroy-all-models


This blog shows the progress that has been made in 3 months in the Canonical Distribution of Kubernetes to deploy GPU workloads. What used to take 3 posts now takes only 1 😁

The Tensorflow chart is being use by various teams right now, and they are all giving very useful feedback. We’re now looking into

  • Optionally converting the deployments into jobs, so that training can be a one shot instead of a continuously running action
  • Improving placement / scheduling to maximize performance
  • Adding more storage backends

Let me how that goes for you, and if you have a use case for Tensorflow and would like to use this, let me know, so we collectively end up with a useful chart for everyone.

And again, don’t hesitate to come to us on Freenode #juju to discuss and onboard the Kubernetes+GPU+Tensorflow train!