Kubernetes + GPUs đ Tensorflow
A little while ago I wrote a series of blog posts about Deep Learning and Kubernetes, using the Canonical Distribution of Kubernetes (CDK) on AWS and Bare Metal to deploy Tensorflow. These posts were highly technical, fairly long and difficult to replicate.
Yesterday I published another post, explaining how the recent addition of GPUs as first class citizens in CDK changed the game. This last post has had a lot of success. Thank you all for that, it is always pleasant to see that content reaches its audience and that people are interested in your work.
The wide audience it reached also led me to conclude that I should revisit my previous work in light of all the comments and feedback I got, and explain how having trivialized GPUs access in Kubernetes can help scientists to deploy scalable, long running Tensorflow jobs (or any other deep learning afaiac).
This post is the result of that thought process.
The plan
So today, we will:
- Deploy a GPU enabled Kubernetes cluster, made of 1 CPU node and 3 GPU-enabled nodes;
- Add Storage to the cluster, via EFS;
- Deploy a Tensorflow workload using Helm;
- Learn how you can adapt this for your own Tensorflow code.
For the sake of clarity, there is nothing new here, it is an updated, condensed version of my series about Tensorflow on Kubernetes Part 1, Part 2 and Part 3 that benefits from the latest and greatest additions to the Canonical Distribution of Kubernetes.
Requirements
To replicate this post, you will need:
- Understanding of the tooling Canonical develops and uses: Ubuntu and Juju;
- An admin account (Key and Secret) on AWS and enough credits to run p2 instances for a little while;
- Understanding of the tooling for Kubernetes: kubectl and helm;
- An Ubuntu 16.04 or higher, CentOS 6+, MacOS X, or Windows 7+ machine. This blog will focus on Ubuntu, but you can follow guidelines for other OSes via this presentation.
If you experience any issue in deploying this, or if you have specific requirements (non default VPCs, subnetsâŚ), connect with us on IRC. I am SaMnCo on Freenode #juju, and the rest of the CDK team is also available there to help.
Preparing your environment
First of all letâs deploy Juju on your machine as well as a couple of useful tools:
sudo add-apt-repository -y ppa:juju/devel
sudo apt update
sudo apt install -yqq juju jq git python-pip
sudo pip install --upgrade --user awscli
Now letâs add credentials for AWS so you can deploy there:
juju add-credential aws
The above line is interactive and will guide you through the process. Finally, letâs download kubectl and helm
# kubectl
curl -LO https://storage.googleapis.com/kubernetes-release/release/1.6.2/bin/linux/amd64/kubectl
chmod +x kubectl && sudo mv kubectl /usr/local/bin/# helm
wget https://kubernetes-helm.storage.googleapis.com/helm-v2.3.1-linux-amd64.tar.gz
tar xfz helm-v2.3.1-linux-amd64.tar.gz
chmod +x linux-amd64/helm && sudo mv linux-amd64/helm /usr/local/bin/
rm -rf helm-v2.3.1-linux-amd64.tar.gz linux-amd64
Clone the charts repository
git clone https://github.com/madeden/charts.git
Clone the Docker Images repository for inspiration:
git clone https://github.com/madeden/docker-images.git
Clone this repository to access the source documents:
git clone https://github.com/madeden/blogposts.git
cd blogposts/k8s-tensorflow
OK! Weâre good to go.
Deploying the cluster
First of all, we need to bootstrap in a region that is GPU-enabled, such as us-east-1
juju bootstrap aws/us-east-1
Now deploy with :
juju deploy src/bundles/k8s-tensorflow.yaml
This will take some time. You can track how the deployment converges with :
watch -c juju status --color
At the end, you should see all units âidleâ, all machines âstartedâ and it should look green.
At this point in time, as CUDA enablement is now fully automated, we can safely assume that our cluster is ready to operate GPU workloads. We can download the configuration:
mkdir -p ~/.kube
juju scp kubernetes-master/0:config ~/.kube/config
and query the cluster state:
kubectl cluster-info
Kubernetes master is running at https://34.201.171.159:6443
Heapster is running at https://34.201.171.159:6443/api/v1/proxy/namespaces/kube-system/services/heapster
KubeDNS is running at https://34.201.171.159:6443/api/v1/proxy/namespaces/kube-system/services/kube-dns
kubernetes-dashboard is running at https://34.201.171.159:6443/api/v1/proxy/namespaces/kube-system/services/kubernetes-dashboard
Grafana is running at https://34.201.171.159:6443/api/v1/proxy/namespaces/kube-system/services/monitoring-grafana
InfluxDB is running at https://34.201.171.159:6443/api/v1/proxy/namespaces/kube-system/services/monitoring-influxdb
The default username:password for the dashboard or Graphana is âadmin:adminâ.
Letâs have a look:
Connecting an EFS drive
First of all you need to know the VPC ID where you are deploying. You can access it from the AWS UI, or with
export VPC_ID=$(aws ec2 describe-vpcs | jq -r '.[][].VpcId')
Now we need to know the subnets where our instances are deployed:
export SUBNET_IDS=$(aws ec2 describe-subnets --filter Name=vpc-id,Values=vpc-b4ce2bd1 | jq -r '.[][].SubnetId')
And now we need the list of Security Groups to add to our EFS access:
export SG_ID=$(aws ec2 describe-instances --filter "Name=tag:Name,Values=juju-default-machine-0" | jq -r '.[][].Instances[].SecurityGroups[0].GroupId')
At this point, if have an EFS already deployed, just store its ID in EFS_ID, and make sure you delete all its mount points. If you do not have an EFS, create one with :
export EFS_ID=$(aws efs create-file-system --creation-token $(uuid) | jq --raw-output '.FileSystemId')
And finally create the mount endpoint with :
for subnet in ${SUBNET_IDS}
do
aws --profile canonical --region us-east-1 efs create-mount-target \
--file-system-id ${EFS_ID} \
--subnet-id ${subnet} \
--security-groups ${SG_ID}
done
From the UI, you can now check that the EFS has mount points available from the Juju Security Group, which should look like juju-3d57d67a-8603â4161â8fc2-dc7e1ee08eef.
Note: users of this Tensorflow use case have reported that EFS is a little slow, and that other methods such as SSD EBS are faster. Consider this the easy demo path. If you have an advanced, IO intensive training, then let me know, and weâll sort you out. This is something we are already working on.
Connecting EFS as a Persistent Volume in Kubernetes
We are now done with the Juju and AWS side, and can focus on deploying applications via the Helm Charts. Letâs switch to our charts folder:
cd ../../charts
Now letâs create our configuration file for the EFS volume. Copy efs/values.yaml to ./efs.yaml
cp efs/values.yaml ./efs.yaml
and update the id of the EFS volume. It shall look like:
global:
services:
aws:
region: us-east-1
efs:
id: fs-47cd610estorage:
name: tensorflow-fs
accessMode: ReadWriteMany
pv:
capacity: "900Gi"
pvc:
request: "750Gi"
Now deploy it with Helm
helm install efs --name efs --values ./efs.yaml
It will output something like:
NAME: efs
LAST DEPLOYED: Thu Apr 20 13:57:37 2017
NAMESPACE: default
STATUS: DEPLOYEDRESOURCES:
==> v1/PersistentVolumeClaim
NAME STATUS VOLUME CAPACITY ACCESSMODES AGE
tensorflow-fs Bound tensorflow-fs 900Gi RWX 1s==> v1/PersistentVolume
NAME CAPACITY ACCESSMODES RECLAIMPOLICY STATUS CLAIM REASON AGE
tensorflow-fs 900Gi RWX Retain Bound default/tensorflow-fs 1sNOTES:
This chart has created* a PV tensorflow-fs of 900Gi capacity in ReadWriteMany mode
* a PVC tensorflow-fs of 750Gi in ReadWriteMany modeThey consume the EFS volume fs-47cd610e.efs.us-east-1.amazonaws.com
In the KubeUI, you can see that the PVC is âboundâ, which means it is connected and available for consumption by other pods.
Running an example Distributed CNN
There is a non GPU workload available in the charts repo called âdistributed-cnnâ. It is an example I took from a Google Tensorflow Workshop and adapted to port it to Helm.
Letâs copy the values.yaml file and adapt it:
cp distributed-cnn/values.yaml ./dcnn.yaml
and update the id of the EFS volume. The different sections are:
- Data Initialization: this section configures a job to download the data into our EFS drive and uncompress it. If you already have run this example, you can set the âisRequiredâ to false, and it wonât attempt to run it again. If itâs your first time, keep it to âtrueâ
global:
imagePullPolicy: IfNotPresentstorage:
name: tensorflow-fsdataImport:
isRequired: true
service:
name: load-data
settings:
url: https://storage.googleapis.com/oscon-tf-workshop-materials/processed_reddit_data/news_aww/prepared_data.tar.gz
image:
repo: gcr.io/google-samples
name: tf-workshop
dockerTag: v2
- Cluster Configuration: This tells if we want to use GPUs, and how many Parameter Servers and Workers we want to use. Note that the code doesnât support GPU, this is really an example of what would happen from a configuration standpoint.
tfCluster:
service:
name: tensorflow-cluster
internalPort: 8080
externalPort: 8080
type: ClusterIP
image:
repo: gcr.io/google-samples
name: tf-worker-example
dockerTag: latest
settings:
isGpu: false
jobs:
ps: 8
worker: 16
- And the Tensorboard ingress endpoint to monitor the results. Here there is a DNS required. Set it to the result of
echo $(juju status kubernetes-worker-cpu --format json | jq -r '.machines[]."dns-name"').xip.io
which will leverage the Xip.io service to connect later on. The section of the file itself:
tensorboard:
replicaCount: 1
image:
repo: gcr.io/tensorflow
name: tensorflow
dockerTag: 1.1.0-rc2
service:
name: tensorboard
dns: 34.204.92.163.xip.io
type: ClusterIP
externalPort: 6006
internalPort: 6006
command: '["tensorboard", "--logdir", "/var/tensorflow/output"]'
settings:
resources:
requests:
cpu: 1000m
memory: 1Gi
limits:
cpu: 4000m
memory: 8Gi
Now deploy it with Helm
helm install distributed-cnn --name dcnn --values ./dcnn.yaml
depending on the number of workers and parameter servers, the output will be more or less extensive. At the end, you can see the indications:
NOTES:
This chart has deployed a cluster a Distributed CNN. It has:* 8 parameter servers
* 16 workersEnjoy Tensorflow on 34.204.92.163.xip.io when the service is fully deployed, which takes a few minutes.You can monitor the status with "kubectl get pods"
Thatâs it, you can connect to your tensorboard and see the output.
If you want to compare results, just create several values.yaml files and upgrade your cluster:
helm upgrade dcnn distributed-cnn --values ./new-values-files.yaml
When you are happy and this and ready to get rid of it, you can delete it with
helm delete dcnn --purge
Running your own code
Itâs always nice to understand things by copying what others have already done. Thatâs open source. Itâs also the fastest way to get things done in 90% of the time. Donât reinvent the wheel.
Then the next step is the DIY, and, if you are reading this post now, you probably want to run your own Tensorflow code.
So start a new Dockerfile, and fill it with:
FROM tensorflow/tensorflow:1.1.0-rc2-gpuADD worker.py worker.pyCMD ["python", "worker.py"]
Adapt this to your need (versionâŚ), and write this worker.py file that youâll need. Make sure of:
- Following the guidelines for distributed tensorflow
- Expect that the environment variable POD_NAME will contain your job name and id (worker or ps, then the task index with the format worker-0)
- Expect that the environment variable CLUSTER_CONFIG will contain your cluster configuration
The example Google gives looks like:
import tensorflow as tf
import os
import logging
import sys
import astroot = logging.getLogger()
root.setLevel(logging.INFO)
ch = logging.StreamHandler(sys.stdout)
root.addHandler(ch)POD_NAME = os.environ.get('POD_NAME')
CLUSTER_CONFIG = os.environ.get('CLUSTER_CONFIG')
logging.info(POD_NAME)def main(job_name, task_id, cluster_def):
server = tf.train.Server(
cluster_def,
job_name=job_name,
task_index=task_id
)
server.join()if __name__ == '__main__':
this_job_name, this_task_id, _ = POD_NAME.split('-', 2)
cluster_def = ast.literal_eval(CLUSTER_CONFIG)
main(this_job_name, int(this_task_id), cluster_def)
Once you have that ready, build and publish your images. You can have a CPU only image for the Parameter Servers, and a GPU one for the workers, or just one of them. Up to you to decide.
Note: Example Dockerfiles are available in the docker-image repository for this, with several version of Tensorflow
Then copy the values.yaml file from the tensorflow chart, and adapt it by pointing the cluster to your images. Then follow the same method as before
helm install tensorflow --name fs --values ./tf.yaml
Tearing down
To tear down the Juju cluster once you are done:
juju destroy-controller aws-us-east-1 --destroy-all-models
Conclusion
This blog shows the progress that has been made in 3 months in the Canonical Distribution of Kubernetes to deploy GPU workloads. What used to take 3 posts now takes only 1 đ
The Tensorflow chart is being use by various teams right now, and they are all giving very useful feedback. Weâre now looking into
- Optionally converting the deployments into jobs, so that training can be a one shot instead of a continuously running action
- Improving placement / scheduling to maximize performance
- Adding more storage backends
Let me how that goes for you, and if you have a use case for Tensorflow and would like to use this, let me know, so we collectively end up with a useful chart for everyone.
And again, donât hesitate to come to us on Freenode #juju to discuss and onboard the Kubernetes+GPU+Tensorflow train!