How we commoditized GPUs for Kubernetes

Published in

Intuition Machine

11 min readApr 18, 2017

[Edit] A careful reader informed me (thanks for that HN user puzzle) that it is no longer required to run in privileged mode to access the GPUs in K8s. I therefore removed a note that previously stated that, and am in the process of updating my Helm charts to remove that requirement.

Over the last 4 months I have blogged 4 times about the enablement of GPUs in Kubernetes. Each time I did so, I spent several days building and destroying clusters until it was just right, making the experience as fluid as possible for adventurous readers.

It was not the easiest task as the environments were different (cloud, bare metal), the hardware was different (g2.xlarge have old K20s, p2 instances have K80s, I had 1060GTX at home but on consumer grade Intel NUC…). As a result, I also spent several hours supporting people to set up clusters. Usually with success, but I must admit some environments have been challenging.

Thankfully the team at Canonical in charge of developing the Canonical Distribution of Kubernetes have productized GPU integration and made it so easy to use that it would just be a shame not to talk about it.

And as of course happiness never comes alone, I was lucky enough to be allocated 3 brand new, production grade Pascal P5000 by our nVidia friends.

I could have installed these in my playful rig to replace the 1060GTX boards. But this would have showed little gratitude for the exceptional gift I received from nVidia. Instead, I decided to go for a full blown “production grade” bare metal cluster, which will allow me to replicate most of the environments customers and partners have. I chose to go for 3x Dell T630 servers, which can be GPU enabled and are very capable machines. I received them a couple of week ago, and…

Please don’t mind the cables, I don’t have a rack…

There we are! Ready for some awesomeness?

What it was in the past

If you remember the other posts, the sequence was:

Deploy a “normal” K8s cluster with Juju;
Add a CUDA charm and relate it to the right group of Kubernetes workers;
Connect on each node, and activate privileged containers, and add the experimental-nvidia-gpu tag to the kubelet. Restart kubelet;
Connect on the API Server, add the experimental-nvidia-gpu tag and restart the API server;
Test that the drivers were installed OK and made available in k8s with Juju and Kubernetes commands.

Overall, on top of the Kubernetes installation, with all the scripting in the world, no less than 30 to 45min were lost to perform the specific maintenance for GPU enablement.

It is better than having no GPUs, but it is often too much for the operators of the clusters who want an instant solution.

How is it now?

I am happy to say that the requests of the community have been heard loud and clear.

As of Kubernetes 1.6.1, and the matching GA release of the Canonical Distribution of Kubernetes, the new experience is :

Deploy a normal K8s cluster with Juju

Yes, you read that well. Single command deployment of GPU-enabled Kubernetes Cluster

Since 1.6.1, the charms will now:

watch for GPU availability every 5min. For clouds like GCE, where GPUs can be added on the fly to instances, this makes sure that no GPU will ever be forgotten;
If one or more GPUs are detected on a worker, the latest and greatest CUDA drivers will be installed on the node, the kubelet reconfigured and restarted automagically;
Then the worker will communicate its new state to the master, which will in return also reconfigure the API server and accept GPU workloads;
In case you have a mixed cluster with some nodes with GPUs and others without, only the right nodes will attempt to install CUDA and accept privileged containers.

You don’t believe me? Fair enough. Watch me…

Requirements

For the following, you’ll need:

Basic understanding of the Canonical toolbox: Ubuntu, Juju, MAAS…
Basic understanding of Kubernetes
A little bit of Helm at the end

and for the files, cloning the repo:

git clone https://github.com/madeden/blogposts
cd blogposts/k8s-ethereum

Putting it to the test

In the cloud

Deploying in the cloud is trivial. Once Juju is installed and your credentials are added,

juju bootstrap aws/us-east-1 
juju deploy src/bundles/k8s-1cpu-3gpu-aws.yaml
watch -c juju status --color

Now wait…

Model    Controller     Cloud/Region   Version
default  aws-us-east-1  aws/us-east-1  2.2-beta2App                    Version  Status       Scale  Charm              Store       Rev  OS      Notes
easyrsa                3.0.1    active           1  easyrsa            jujucharms    8  ubuntu
etcd                   2.3.8    active           1  etcd               jujucharms   29  ubuntu
flannel                0.7.0    active           2  flannel            jujucharms   13  ubuntu
kubernetes-master      1.6.1    waiting          1  kubernetes-master  jujucharms   17  ubuntu  exposed
kubernetes-worker-cpu  1.6.1    active           1  kubernetes-worker  jujucharms   22  ubuntu  exposed
kubernetes-worker-gpu           maintenance      3  kubernetes-worker  jujucharms   22  ubuntu  exposedUnit                      Workload     Agent      Machine  Public address  Ports           Message
easyrsa/0*                active       idle       0/lxd/0  10.0.201.114                    Certificate Authority connected.
etcd/0*                   active       idle       0        52.91.177.229   2379/tcp        Healthy with 1 known peer
kubernetes-master/0*      waiting      idle       0        52.91.177.229   6443/tcp        Waiting for kube-system pods to start
  flannel/0*              active       idle                52.91.177.229                   Flannel subnet 10.1.4.1/24
kubernetes-worker-cpu/0*  active       idle       1        34.207.180.182  80/tcp,443/tcp  Kubernetes worker running.
  flannel/1               active       idle                34.207.180.182                  Flannel subnet 10.1.29.1/24
kubernetes-worker-gpu/0   maintenance  executing  2        54.146.144.181                  (install) Installing CUDA
kubernetes-worker-gpu/1   maintenance  executing  3        54.211.83.217                   (install) Installing CUDA
kubernetes-worker-gpu/2*  maintenance  executing  4        54.237.248.219                  (install) Installing CUDAMachine  State    DNS             Inst id              Series  AZ          Message
0        started  52.91.177.229   i-0d71d98b872d201f5  xenial  us-east-1a  running
0/lxd/0  started  10.0.201.114    juju-29e858-0-lxd-0  xenial              Container started
1        started  34.207.180.182  i-04f2b75f3ab88f842  xenial  us-east-1a  running
2        started  54.146.144.181  i-0113e8a722778330c  xenial  us-east-1a  running
3        started  54.211.83.217   i-07c8c81f5e4cad6be  xenial  us-east-1a  running
4        started  54.237.248.219  i-00ae437291c88210f  xenial  us-east-1a  runningRelation      Provides               Consumes               Type
certificates  easyrsa                etcd                   regular
certificates  easyrsa                kubernetes-master      regular
certificates  easyrsa                kubernetes-worker-cpu  regular
certificates  easyrsa                kubernetes-worker-gpu  regular
cluster       etcd                   etcd                   peer
etcd          etcd                   flannel                regular
etcd          etcd                   kubernetes-master      regular
cni           flannel                kubernetes-master      regular
cni           flannel                kubernetes-worker-cpu  regular
cni           flannel                kubernetes-worker-gpu  regular
cni           kubernetes-master      flannel                subordinate
kube-dns      kubernetes-master      kubernetes-worker-cpu  regular
kube-dns      kubernetes-master      kubernetes-worker-gpu  regular
cni           kubernetes-worker-cpu  flannel                subordinate
cni           kubernetes-worker-gpu  flannel                subordinate

Same same, but same. I was able to capture the moment where it is installing CUDA so you can see it… When it’s done:

juju ssh kubernetes-worker-gpu/0 "sudo nvidia-smi"
Tue Apr 18 08:50:23 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.51                 Driver Version: 375.51                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 0000:00:1E.0     Off |                    0 |
| N/A   52C    P0    67W / 149W |      0MiB / 11439MiB |     98%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
Connection to 54.146.144.181 closed.

That’s it, you can see the K80 from the p2.xlarge instance. I didn’t do anything about it, it was completely automated. This is Kubernetes on GPU steroids.

On Bare Metal

Obviously there is a little more to do on Bare Metal, and I will refer you to my previous posts to understand how to set MAAS up & running. This assumes it is already working.

Adding the T630 to MAAS is a breeze. If you don’t change the default iDRAC username password (root/calvin), the only thing you have to do it connect them to a network (a specific VLAN for management is preferred of course), set the IP address, and add to MAAS with an IPMI Power type.

Then commission the nodes as you would with any other. This time, you won’t need to press the power button like I had to with the NUC cluster: MAAS will trigger via the IPMI card directly, request a PXE boot, and register the node, all fully automagically.

Once that is done, tag them “gpu” to make sure to recognize them.

Then

juju bootstrap maas
juju deploy src/bundles/k8s-1cpu-3gpu.yaml
watch -c juju status --color

Wait for a few minutes… You will see at some point that the charm is now installing CUDA drivers. At the end,

Model    Controller  Cloud/Region  Version
default  k8s         maas          2.1.2.1App                    Version  Status  Scale  Charm              Store       Rev  OS      Notes
easyrsa                3.0.1    active      1  easyrsa            jujucharms    8  ubuntu
etcd                   2.3.8    active      1  etcd               jujucharms   29  ubuntu
flannel                0.7.0    active      5  flannel            jujucharms   13  ubuntu
kubernetes-master      1.6.1    active      1  kubernetes-master  jujucharms   17  ubuntu  exposed
kubernetes-worker-cpu  1.6.1    active      1  kubernetes-worker  jujucharms   22  ubuntu  exposed
kubernetes-worker-gpu  1.6.1    active      3  kubernetes-worker  jujucharms   22  ubuntu  exposedUnit                      Workload  Agent  Machine  Public address  Ports           Message
easyrsa/0*                active    idle   0/lxd/0  172.16.0.8                      Certificate Authority connected.
etcd/0*                   active    idle   0        172.16.0.4      2379/tcp        Healthy with 1 known peer
kubernetes-master/0*      active    idle   0        172.16.0.4      6443/tcp        Kubernetes master running.
  flannel/1               active    idle            172.16.0.4                      Flannel subnet 10.1.9.1/24
kubernetes-worker-cpu/0*  active    idle   1        172.16.0.5      80/tcp,443/tcp  Kubernetes worker running.
  flannel/0*              active    idle            172.16.0.5                      Flannel subnet 10.1.20.1/24
kubernetes-worker-gpu/0   active    idle   2        172.16.0.6      80/tcp,443/tcp  Kubernetes worker running.
  flannel/2               active    idle            172.16.0.6                      Flannel subnet 10.1.91.1/24
kubernetes-worker-gpu/1   active    idle   3        172.16.0.7      80/tcp,443/tcp  Kubernetes worker running.
  flannel/4               active    idle            172.16.0.7                      Flannel subnet 10.1.19.1/24
kubernetes-worker-gpu/2*  active    idle   4        172.16.0.3      80/tcp,443/tcp  Kubernetes worker running.
  flannel/3               active    idle            172.16.0.3                      Flannel subnet 10.1.15.1/24Machine  State    DNS         Inst id              Series  AZ
0        started  172.16.0.4  br68gs               xenial  default
0/lxd/0  started  172.16.0.8  juju-5a80fa-0-lxd-0  xenial
1        started  172.16.0.5  qkrh4t               xenial  default
2        started  172.16.0.6  4y74eg               xenial  default
3        started  172.16.0.7  w3pgw7               xenial  default
4        started  172.16.0.3  se8wy7               xenial  defaultRelation      Provides               Consumes               Type
certificates  easyrsa                etcd                   regular
certificates  easyrsa                kubernetes-master      regular
certificates  easyrsa                kubernetes-worker-cpu  regular
certificates  easyrsa                kubernetes-worker-gpu  regular
cluster       etcd                   etcd                   peer
etcd          etcd                   flannel                regular
etcd          etcd                   kubernetes-master      regular
cni           flannel                kubernetes-master      regular
cni           flannel                kubernetes-worker-cpu  regular
cni           flannel                kubernetes-worker-gpu  regular
cni           kubernetes-master      flannel                subordinate
kube-dns      kubernetes-master      kubernetes-worker-cpu  regular
kube-dns      kubernetes-master      kubernetes-worker-gpu  regular
cni           kubernetes-worker-cpu  flannel                subordinate
cni           kubernetes-worker-gpu  flannel                subordinate

And now:

juju ssh kubernetes-worker-gpu/0 "sudo nvidia-smi"
Tue Apr 18 06:08:35 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.51                 Driver Version: 375.51                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 106...  Off  | 0000:04:00.0     Off |                  N/A |
| 28%   37C    P0    28W / 120W |      0MiB /  6072MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Quadro P5000        Off  | 0000:83:00.0     Off |                  Off |
|  0%   43C    P0    39W / 180W |      0MiB / 16273MiB |      2%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

That’s it, my 2 cards are in there: 1060GTX and P5000. Again, no user interaction. How awesome is this?

Note that the interesting aspects are not only that it automated the GPU enablement, but also that the bundle files (the yaml content) are essentially the same, but for the machine constraints we set.

Having some fun with GPUs

If you follow me you know I’ve been playing with Tensorflow, so that would be a use case, but I actually wanted to get some raw fun with them! One of my readers mentioned bitcoin mining once, so I decided to go for it.

I made a quick and dirty Helm Chart for an Ethereum Miner, along with a simple rig monitoring system called ethmon.

This chart will let you configure how many nodes, and how many GPU per node you want to use. Then you can also tweak the miner. For now, it only works in ETH only mode.

Don’t forget to create a values.yaml file to

add your own wallet (if you keep the default you’ll actually pay me, which is fine :) but not necessarily your purpose),
update the ingress xip.io endpoint to match the public IP of one of your workers or use your own DNS
Adjust the number of workers and GPUs per worker

then

cd ~
git clone https://github.com/madeden/charts.git
cd charts
helm init
helm install claymore --name claymore --values /path/to/yourvalues.yaml

By default, you’ll get the 3 worker nodes, with 2 GPUs (this is to work on my rig at home)

What did I learn from it? Well,

I really need to work on my tuning per card here! The P5000 and the 1060GTX have the same performances, and they also are the same as my Quadro M4000. This is not right (or there is a cap somewhere). But I’m a newbie, I’ll get better.
It’s probably not worth it money wise. This would make me less than $100/month with this cluster, less than my electricity bill to run it.
There is a LOT of room for Monero mining on the CPU! I run at less than a core for the 6 workers.
I’ll probably update it to run less workers, but with all the GPUs allocated to them.
But it was very fun to make. And now apparently I need to do “monero”, which is supposedly ASIC resistent and should be more profitable. Stay tuned ;)

If you’re interested you can track the evolution of my tuning.

Conclusion

3 months ago, I recognize running Kubernetes with GPUs wasn’t a trivial job. It was possible, but you needed to really want it.

Today, if you are looking for CUDA workloads, I challenge you to find anything easier than the Canonical Distribution of Kubernetes to run that, on Bare Metal or in the cloud. It is literally so trivial to make it work that it’s boring. Exactly what you want from infrastructure.

GPUs are the new normal. Get used to it.

So, let me know of your use cases, and I will put this cluster to work on something a little more useful for mankind than a couple of ETH!

I am always happy to do some skunk work, and if you combine GPUs and Kubernetes, you’ll just be targeting my 2 favorite things in the compute world. Shoot me a message @SaMnCo_23!