VMWare vSphere, Kubernetes… And GPUs of course

Samuel Cozannet
HackerNoon.com
8 min readMay 17, 2017

--

What I am about to say may seem obvious, but a LOT of people out there are using VMWare vSphere to virtualize all kinds of workloads. Of course, that means I get a LOT of questions about the integration of the Canonical Distribution of Kubernetes with vSphere.

Until very recently, to be fair, it was not so easy. You could do it, but well, you had to spend the time to do some manual tweaks here and there, adjust hostnames on each VM… Most of the road bumps were due to a simple thing: VMWare does not support cloud-init, the de-facto standard to bootstrap VMs in pretty much every other cloudy solution.

The team has spent a fair amount of time improving the UX of Juju for vSphere, and I am pleased to say that it now works pretty well, including activating GPUs (what else?)!!!

Let’s see what the UX looks like now!

Requirements

To reproduce this post, you’ll need:

  • Basic understanding of the Canonical toolbox: Ubuntu and Juju;
  • Basic understanding of Kubernetes;
  • a VMWare vSphere cluster that can access Internet (at least proxied) and has at least a public (routable) network for the VMs, with a DNS working for all nodes created (or you’ll have some edit to /etc/hosts to do);
  • the will to leave on the edge: juju 2.2rc1

and for the files, cloning the repo:

vSphere setup

I am no expert in VMWare, so I didn’t change anything to the default setup:

  • Installed ESXi 6.5 from the latest ISO on 3 Dell T630 with 12c / 32GB RAM each;
  • Installed the vCenter Appliance on the first host;
  • For each host, I activated GPU passthrough using this guide;
  • Then I created a datacenter in the vCenter, which I called “Region1”

That’s it, really the default setup for everything else: I didn’t touch networking or storage. If you have a specific setup, I’d be happy to talk and review production systems to validate the integration.

Juju experience

Connecting to vSphere

Once you have vSphere installed, you need to let Juju know about it:

Now you need to configure the credentials for this cloud:

Bootstrapping

A classic bynow, the bootstrap code for Juju:

I prepared a small bundle in the src folder, which you can install with:

then you can wait for the model to converge to a stable state:

juju status fully converged

In vSphere, this will translate in something like:

vSphere UI after bootstrap and deployment

Then you can download the credentials are query the cluster:

OK!! You now have a Kubernetes cluster up & running on VMWare vSphere. Wasn’t too complicated, was it? Should we say it was boring?

Adding GPUs

On vSphere

OK so the cool stuff now. Using the same guide as before, add GPUs to the VMs running the Kubernetes workers.

You’ll first need to stop them, then add the PCI device, and restart them.

In Kubernetes

At this point, Juju should pick up and discover the nVidia board and install the CUDA drivers all by itself. For some reason it did not, and we are investigating.

But we don’t stop at a small glitch. Let’s install that manually, which will also give me the occasion to answer questions I got about managing CDK now that the control plane has been fully snapped.

Google has this simple script to install the drivers:

Just run it on the 2 workers (eventually make use of “juju scp” and “juju ssh”)

Now on each worker, you need to activate a couple of flags. There is a new procedure to do so as GPUs are now “Accelerators” in K8s:

Note: If you read my previous posts, maybe you remember we were editing files with sed, awk and other nice text edition foo. Now it is a set command for the snap. It DOES matter. Suddenly as the admin, you’re not in charge of managing idempotency of your code as you delegate that to snapd. It is a game changer and makes things a lot more trivial than they used to. And that is even without mentioning the new upgrade path made soooo simple.

Now on the master,

OK, you’re good to go, you now have GPUs activated in K8s

Testing the results

A classic to start with:

There you go, it just works :)

nVidia P5000, passthrough in vSphere

More cryptocurrencies?

I noticed you guys LOVE cryptocurrencies, so I wrote a new chart for Minergate, which you’ll find in https://github.com/madeden/charts

It’s not the fastest miner ever, but it’s OK, and it can do CUDA mining out of the box, making it very cool for testing.

Have a look at the config file, create an account on https://minergate.com and start playing:

You can configure the following values:

  • clusterName: (ob) just a way to run several times the same chart and still id workers easily
  • pool: (minergate) also a differenciator
  • coin: (-qcn) the crypto currency you want to mine, with a “-” before it.
  • nodes: (2): How many nodes are in the cluster to welcome miners
  • workersPerNode: (1) How many workers you want to deploy per node
  • cpusPerWorker: (1): How many CPU cores do you want to allocate per miner
  • gpuComplexity: (4): If using GPU, how much stress to put on the GPU? Use 0 if you want to do CPU mining only
  • username: (samnco@gmail.com) Your Minergate ID

The logs should look like:

Enjoy! Of course this is Helm, so you are only limited by Kubernetes, not being on VMWare or any other substrate.

More seriously, any GPU workload you have: Deep Learning, physics computation, cracking passwords, transcoding videos… all that will be drastically improved with such a setup.

That’s right, we have Kubernetes AND GPUs

Conclusion

The Juju experience on VMWare has drastically improved over the last few weeks. It is now particularly easy to operate big software on vSphere.

The Canonical Distribution of Kubernetes is one example, but Spicule, a long time partner of Canonical, does Big Data consulting and integration with Pentaho with it and can now leverage VMWare as a target as well.

It is also good to know that MAAS can integrate VMWare as a “bare metal layer”, so you can essentially record VMs from VMWare in MAAS, and use it to start them or stop them.

We’re about to complete our tour of activating nVidia GPUs on all clouds, bare metal and so on. Next stop: Microsoft Azure, and the loop will be closed.

Any question, I am @SaMnCo_23 on Twitter, #SaMnCo on Freenode and GitHub. Feel free to ping me!

And of course, if you liked this, found it useful, or just want to help, click the little heart! Thanks for reading!

--

--