How we commoditized GPUs for Kubernetes

Samuel Cozannet
Intuition Machine
Published in
11 min readApr 18, 2017

[Edit] A careful reader informed me (thanks for that HN user puzzle) that it is no longer required to run in privileged mode to access the GPUs in K8s. I therefore removed a note that previously stated that, and am in the process of updating my Helm charts to remove that requirement.

Over the last 4 months I have blogged 4 times about the enablement of GPUs in Kubernetes. Each time I did so, I spent several days building and destroying clusters until it was just right, making the experience as fluid as possible for adventurous readers.

It was not the easiest task as the environments were different (cloud, bare metal), the hardware was different (g2.xlarge have old K20s, p2 instances have K80s, I had 1060GTX at home but on consumer grade Intel NUC…). As a result, I also spent several hours supporting people to set up clusters. Usually with success, but I must admit some environments have been challenging.

Thankfully the team at Canonical in charge of developing the Canonical Distribution of Kubernetes have productized GPU integration and made it so easy to use that it would just be a shame not to talk about it.

And as of course happiness never comes alone, I was lucky enough to be allocated 3 brand new, production grade Pascal P5000 by our nVidia friends.

I could have installed these in my playful rig to replace the 1060GTX boards. But this would have showed little gratitude for the exceptional gift I received from nVidia. Instead, I decided to go for a full blown “production grade” bare metal cluster, which will allow me to replicate most of the environments customers and partners have. I chose to go for 3x Dell T630 servers, which can be GPU enabled and are very capable machines. I received them a couple of week ago, and…

Please don’t mind the cables, I don’t have a rack…

There we are! Ready for some awesomeness?

What it was in the past

If you remember the other posts, the sequence was:

  1. Deploy a “normal” K8s cluster with Juju;
  2. Add a CUDA charm and relate it to the right group of Kubernetes workers;
  3. Connect on each node, and activate privileged containers, and add the experimental-nvidia-gpu tag to the kubelet. Restart kubelet;
  4. Connect on the API Server, add the experimental-nvidia-gpu tag and restart the API server;
  5. Test that the drivers were installed OK and made available in k8s with Juju and Kubernetes commands.

Overall, on top of the Kubernetes installation, with all the scripting in the world, no less than 30 to 45min were lost to perform the specific maintenance for GPU enablement.

It is better than having no GPUs, but it is often too much for the operators of the clusters who want an instant solution.

How is it now?

I am happy to say that the requests of the community have been heard loud and clear.

As of Kubernetes 1.6.1, and the matching GA release of the Canonical Distribution of Kubernetes, the new experience is :

  1. Deploy a normal K8s cluster with Juju

Yes, you read that well. Single command deployment of GPU-enabled Kubernetes Cluster

Since 1.6.1, the charms will now:

  • watch for GPU availability every 5min. For clouds like GCE, where GPUs can be added on the fly to instances, this makes sure that no GPU will ever be forgotten;
  • If one or more GPUs are detected on a worker, the latest and greatest CUDA drivers will be installed on the node, the kubelet reconfigured and restarted automagically;
  • Then the worker will communicate its new state to the master, which will in return also reconfigure the API server and accept GPU workloads;
  • In case you have a mixed cluster with some nodes with GPUs and others without, only the right nodes will attempt to install CUDA and accept privileged containers.

You don’t believe me? Fair enough. Watch me…

Requirements

For the following, you’ll need:

  • Basic understanding of the Canonical toolbox: Ubuntu, Juju, MAAS…
  • Basic understanding of Kubernetes
  • A little bit of Helm at the end

and for the files, cloning the repo:

git clone https://github.com/madeden/blogposts
cd blogposts/k8s-ethereum

Putting it to the test

In the cloud

Deploying in the cloud is trivial. Once Juju is installed and your credentials are added,

juju bootstrap aws/us-east-1 
juju deploy src/bundles/k8s-1cpu-3gpu-aws.yaml
watch -c juju status --color

Now wait…

Model    Controller     Cloud/Region   Version
default aws-us-east-1 aws/us-east-1 2.2-beta2
App Version Status Scale Charm Store Rev OS Notes
easyrsa 3.0.1 active 1 easyrsa jujucharms 8 ubuntu
etcd 2.3.8 active 1 etcd jujucharms 29 ubuntu
flannel 0.7.0 active 2 flannel jujucharms 13 ubuntu
kubernetes-master 1.6.1 waiting 1 kubernetes-master jujucharms 17 ubuntu exposed
kubernetes-worker-cpu 1.6.1 active 1 kubernetes-worker jujucharms 22 ubuntu exposed
kubernetes-worker-gpu maintenance 3 kubernetes-worker jujucharms 22 ubuntu exposed
Unit Workload Agent Machine Public address Ports Message
easyrsa/0* active idle 0/lxd/0 10.0.201.114 Certificate Authority connected.
etcd/0* active idle 0 52.91.177.229 2379/tcp Healthy with 1 known peer
kubernetes-master/0* waiting idle 0 52.91.177.229 6443/tcp Waiting for kube-system pods to start
flannel/0* active idle 52.91.177.229 Flannel subnet 10.1.4.1/24
kubernetes-worker-cpu/0* active idle 1 34.207.180.182 80/tcp,443/tcp Kubernetes worker running.
flannel/1 active idle 34.207.180.182 Flannel subnet 10.1.29.1/24
kubernetes-worker-gpu/0 maintenance executing 2 54.146.144.181 (install) Installing CUDA
kubernetes-worker-gpu/1 maintenance executing 3 54.211.83.217 (install) Installing CUDA
kubernetes-worker-gpu/2* maintenance executing 4 54.237.248.219 (install) Installing CUDA
Machine State DNS Inst id Series AZ Message
0 started 52.91.177.229 i-0d71d98b872d201f5 xenial us-east-1a running
0/lxd/0 started 10.0.201.114 juju-29e858-0-lxd-0 xenial Container started
1 started 34.207.180.182 i-04f2b75f3ab88f842 xenial us-east-1a running
2 started 54.146.144.181 i-0113e8a722778330c xenial us-east-1a running
3 started 54.211.83.217 i-07c8c81f5e4cad6be xenial us-east-1a running
4 started 54.237.248.219 i-00ae437291c88210f xenial us-east-1a running
Relation Provides Consumes Type
certificates easyrsa etcd regular
certificates easyrsa kubernetes-master regular
certificates easyrsa kubernetes-worker-cpu regular
certificates easyrsa kubernetes-worker-gpu regular
cluster etcd etcd peer
etcd etcd flannel regular
etcd etcd kubernetes-master regular
cni flannel kubernetes-master regular
cni flannel kubernetes-worker-cpu regular
cni flannel kubernetes-worker-gpu regular
cni kubernetes-master flannel subordinate
kube-dns kubernetes-master kubernetes-worker-cpu regular
kube-dns kubernetes-master kubernetes-worker-gpu regular
cni kubernetes-worker-cpu flannel subordinate
cni kubernetes-worker-gpu flannel subordinate

Same same, but same. I was able to capture the moment where it is installing CUDA so you can see it… When it’s done:

juju ssh kubernetes-worker-gpu/0 "sudo nvidia-smi"
Tue Apr 18 08:50:23 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.51 Driver Version: 375.51 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 0000:00:1E.0 Off | 0 |
| N/A 52C P0 67W / 149W | 0MiB / 11439MiB | 98% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Connection to 54.146.144.181 closed.

That’s it, you can see the K80 from the p2.xlarge instance. I didn’t do anything about it, it was completely automated. This is Kubernetes on GPU steroids.

On Bare Metal

Obviously there is a little more to do on Bare Metal, and I will refer you to my previous posts to understand how to set MAAS up & running. This assumes it is already working.

Adding the T630 to MAAS is a breeze. If you don’t change the default iDRAC username password (root/calvin), the only thing you have to do it connect them to a network (a specific VLAN for management is preferred of course), set the IP address, and add to MAAS with an IPMI Power type.

Adding the nodes into MAAS

Then commission the nodes as you would with any other. This time, you won’t need to press the power button like I had to with the NUC cluster: MAAS will trigger via the IPMI card directly, request a PXE boot, and register the node, all fully automagically.

Once that is done, tag them “gpu” to make sure to recognize them.

details about the T630 in MAAS

Then

juju bootstrap maas
juju deploy src/bundles/k8s-1cpu-3gpu.yaml
watch -c juju status --color

Wait for a few minutes… You will see at some point that the charm is now installing CUDA drivers. At the end,

Model    Controller  Cloud/Region  Version
default k8s maas 2.1.2.1
App Version Status Scale Charm Store Rev OS Notes
easyrsa 3.0.1 active 1 easyrsa jujucharms 8 ubuntu
etcd 2.3.8 active 1 etcd jujucharms 29 ubuntu
flannel 0.7.0 active 5 flannel jujucharms 13 ubuntu
kubernetes-master 1.6.1 active 1 kubernetes-master jujucharms 17 ubuntu exposed
kubernetes-worker-cpu 1.6.1 active 1 kubernetes-worker jujucharms 22 ubuntu exposed
kubernetes-worker-gpu 1.6.1 active 3 kubernetes-worker jujucharms 22 ubuntu exposed
Unit Workload Agent Machine Public address Ports Message
easyrsa/0* active idle 0/lxd/0 172.16.0.8 Certificate Authority connected.
etcd/0* active idle 0 172.16.0.4 2379/tcp Healthy with 1 known peer
kubernetes-master/0* active idle 0 172.16.0.4 6443/tcp Kubernetes master running.
flannel/1 active idle 172.16.0.4 Flannel subnet 10.1.9.1/24
kubernetes-worker-cpu/0* active idle 1 172.16.0.5 80/tcp,443/tcp Kubernetes worker running.
flannel/0* active idle 172.16.0.5 Flannel subnet 10.1.20.1/24
kubernetes-worker-gpu/0 active idle 2 172.16.0.6 80/tcp,443/tcp Kubernetes worker running.
flannel/2 active idle 172.16.0.6 Flannel subnet 10.1.91.1/24
kubernetes-worker-gpu/1 active idle 3 172.16.0.7 80/tcp,443/tcp Kubernetes worker running.
flannel/4 active idle 172.16.0.7 Flannel subnet 10.1.19.1/24
kubernetes-worker-gpu/2* active idle 4 172.16.0.3 80/tcp,443/tcp Kubernetes worker running.
flannel/3 active idle 172.16.0.3 Flannel subnet 10.1.15.1/24
Machine State DNS Inst id Series AZ
0 started 172.16.0.4 br68gs xenial default
0/lxd/0 started 172.16.0.8 juju-5a80fa-0-lxd-0 xenial
1 started 172.16.0.5 qkrh4t xenial default
2 started 172.16.0.6 4y74eg xenial default
3 started 172.16.0.7 w3pgw7 xenial default
4 started 172.16.0.3 se8wy7 xenial default
Relation Provides Consumes Type
certificates easyrsa etcd regular
certificates easyrsa kubernetes-master regular
certificates easyrsa kubernetes-worker-cpu regular
certificates easyrsa kubernetes-worker-gpu regular
cluster etcd etcd peer
etcd etcd flannel regular
etcd etcd kubernetes-master regular
cni flannel kubernetes-master regular
cni flannel kubernetes-worker-cpu regular
cni flannel kubernetes-worker-gpu regular
cni kubernetes-master flannel subordinate
kube-dns kubernetes-master kubernetes-worker-cpu regular
kube-dns kubernetes-master kubernetes-worker-gpu regular
cni kubernetes-worker-cpu flannel subordinate
cni kubernetes-worker-gpu flannel subordinate

And now:

juju ssh kubernetes-worker-gpu/0 "sudo nvidia-smi"
Tue Apr 18 06:08:35 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.51 Driver Version: 375.51 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 106... Off | 0000:04:00.0 Off | N/A |
| 28% 37C P0 28W / 120W | 0MiB / 6072MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Quadro P5000 Off | 0000:83:00.0 Off | Off |
| 0% 43C P0 39W / 180W | 0MiB / 16273MiB | 2% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

That’s it, my 2 cards are in there: 1060GTX and P5000. Again, no user interaction. How awesome is this?

Note that the interesting aspects are not only that it automated the GPU enablement, but also that the bundle files (the yaml content) are essentially the same, but for the machine constraints we set.

Having some fun with GPUs

If you follow me you know I’ve been playing with Tensorflow, so that would be a use case, but I actually wanted to get some raw fun with them! One of my readers mentioned bitcoin mining once, so I decided to go for it.

I made a quick and dirty Helm Chart for an Ethereum Miner, along with a simple rig monitoring system called ethmon.

This chart will let you configure how many nodes, and how many GPU per node you want to use. Then you can also tweak the miner. For now, it only works in ETH only mode.

Don’t forget to create a values.yaml file to

  • add your own wallet (if you keep the default you’ll actually pay me, which is fine :) but not necessarily your purpose),
  • update the ingress xip.io endpoint to match the public IP of one of your workers or use your own DNS
  • Adjust the number of workers and GPUs per worker

then

cd ~
git clone https://github.com/madeden/charts.git
cd charts
helm init
helm install claymore --name claymore --values /path/to/yourvalues.yaml

By default, you’ll get the 3 worker nodes, with 2 GPUs (this is to work on my rig at home)

KubeUI with the miners deployed
Monitoring interface (ethmon)

What did I learn from it? Well,

  • I really need to work on my tuning per card here! The P5000 and the 1060GTX have the same performances, and they also are the same as my Quadro M4000. This is not right (or there is a cap somewhere). But I’m a newbie, I’ll get better.
  • It’s probably not worth it money wise. This would make me less than $100/month with this cluster, less than my electricity bill to run it.
  • There is a LOT of room for Monero mining on the CPU! I run at less than a core for the 6 workers.
  • I’ll probably update it to run less workers, but with all the GPUs allocated to them.
  • But it was very fun to make. And now apparently I need to do “monero”, which is supposedly ASIC resistent and should be more profitable. Stay tuned ;)

If you’re interested you can track the evolution of my tuning.

Conclusion

3 months ago, I recognize running Kubernetes with GPUs wasn’t a trivial job. It was possible, but you needed to really want it.

Today, if you are looking for CUDA workloads, I challenge you to find anything easier than the Canonical Distribution of Kubernetes to run that, on Bare Metal or in the cloud. It is literally so trivial to make it work that it’s boring. Exactly what you want from infrastructure.

GPUs are the new normal. Get used to it.

So, let me know of your use cases, and I will put this cluster to work on something a little more useful for mankind than a couple of ETH!

I am always happy to do some skunk work, and if you combine GPUs and Kubernetes, you’ll just be targeting my 2 favorite things in the compute world. Shoot me a message @SaMnCo_23!

--

--