Creating a Kubernetes Cluster with GPU Support on Azure for Machine Learning

6 min readMay 11, 2017

In this article, we are going to see how to create a Kubernetes cluster that is suited for Machine Learning. Our cluster should have the following extra capabilities:

Support for GPUs
Automated NVIDIA driver installation

In follow up articles, we are also going to talk about:

Autoscaling (because GPUs are expensive 😃)
Visualizing the different training happening in your cluster in real-time with TensorBoard

Creating a Kubernetes cluster using `acs-engine`

First, we need to create a Kubernetes cluster that supports GPUs.
We will use acs-engine, a tool that will generate the ARM template we need to deploy our cluster with everything already configured.

One small caveat: the NVIDIA drivers are not automatically installed for you with acs-engine (associated issue here), so in the mean time we are going to use my fork of acs-engine which has everything we need, making our life much easier, especially if you intend to use autoscaling or provision large number of VMs.

First, clone the repo and make sure you are on the k8s-gpu branch (that should be the default branch):

> git clone https://github.com/wbuchwalter/acs-engine
> git checkout k8s-gpu

Now we need to specify what we want our cluster to look like. Edit example\kubernetes.json and fill the different parameters.

The interesting part is the agentPoolProfiles section.
There you can define a number of different pools. Each pool can have a different VM size, and can scale up to 100 nodes.
You should define a separate pool for every different VM size you intend to use.

For example, if you plan to use your cluster both for training with GPU and inference with CPU only, you should at least specify two pools as you don't want to pay for a GPU if you are not using it.

The number of agents isn’t really important because we are going to enable autoscaling, so you can keep everything to 1.

At the time of this writing, Azure has 6 different VM sizes with GPU support, you can see the details here.

Here is what your kubernetes.json should approximately look like:

Let’s generate the ARM template, you’ll need docker to be installed on your machine. Check out the official documentation to understand how acs-engine works.

> ./scripts/devenv.sh
> make prereqs
> make build
> ./acs-engine generate examples/kubernetes.json

This should have generated a bunch of files under the _output/mygpucluster directory, including the ARM template and parameters that we want.

To deploy them, you can create a new template deployment in the Azure portal and copy paste azurdeploy.json and azuredeploy.parameters.json.
Make sure you choose a region that has N-Series VM available! South Central US is one of them.
Or use Azure CLI:

> cd _output/mygpucluster
> az group create --location southcentralus --name mygpucluster
> az group deployment create --template-file azuredeploy.json --parameters @azuredeploy.parameters.json --resource-group mygpucluster

This should take between 5 and 10 minutes to deploy.
Do not delete the generated azurdeploy.json and azuredeploy.parameters.json as you will need them later if you want to setup autoscaling.

Once the deployment is completed, grab the Kubernetes config file of you cluster to be able to use kubectl locally.

> scp azureuser@<dnsname>.<regionname>.cloudapp.azure.com:.kube/config ~/.kube/config

If you don’t have kubectl installed, now is the time: Installing and Setting Up kubectl

Testing the cluster

Let’s check our new cluster

> kubectl get nodes
NAME                        STATUS                     AGE
k8s-agentpool1-19661165-0   Ready                      1m
k8s-agentpool2-19661165-0   Ready                      23s
k8s-master-19661165-0       Ready,SchedulingDisabled   1m

One master, and two agents, one for each pool. That's what I requested.
Let’s describe one of our agents:

> kubectl describe node k8s-agentpool1-19661165-0
[...]
Capacity:
 alpha.kubernetes.io/nvidia-gpu:	1
 cpu:					6
 memory:				57703024Ki
 pods:					110
[...]

We can see that the drivers have been correctly installed since kubernetes has been able to find our GPU device.
Let’s run nvidia-smi to prove that GPU support is working correctly:

Download this kubernetes template somewhere and deploy it on your cluster (don’t worry about the details of the .yaml file, we will get into this later on):

> kubectl create -f nvidia-smi.yaml

The deployment will take some time to finish, but you should eventually see:

> kubectl get pods --show-all
NAME                       READY     STATUS      RESTARTS   AGE
nvidia-smi-fcg8j           0/1       Completed   0          1m

And if we check the logs:

Creating an image to run on our cluster

You might already have a docker image that you want to use on your cluster for training/inference, in that case you can skip ahead.

If not, you can use this example repository: wbuchwalter/tf-app-container-sample.

Here are the things I wanted my container to be able to do (you might have different requirements, that’s fine):

Ability to train our model, or serve prediction using the same container depending on a flag.
The app should save model checkpoints when the training is done to an azure blob storage, and restore the model before serving prediction.
Ability to run on CPU or GPU.

It is a very simple model in TensorFlow based on the MNIST sample. You can specify --train to start the container in training mode, otherwise a Flask application will serve (random) predictions on the /predict route to demonstrate serving.

Training our model on the cluster

To train our model on our Kubernetes cluster, we are going to define a job template.
Because we want to train using GPUs, we have to specify how many of them we need:

resources:
    limits:
      alpha.kubernetes.io/nvidia-gpu: 1

We also need to expose the NVIDIA drivers from the host into the container. This is a bit tricky for now. The correct mount paths will depend on many things: how you installed the drivers, which host OS you are using etc.

Because the official Tensorflow image is based on Ubuntu 16.04, just like our host VMs in ACS/ACS-engine, we can simply map the bin and lib directories right away without too much issues.

Here is what the template looks like if you used the tf-app-container-sample container on acs-engine (note the use of --train flag):

As you can see, we are directly mounting /usr/bin/ and /usr/lib/x86_64-linux-gnu into the container. This two directories contain everything needed to communicate with the GPU.

Update: If libcudnn.5 isn’t found with this template, take a look at this fix: https://github.com/wbuchwalter/blog-posts/issues/1

To create this job, simply kubectl create -f tensorflow-trainer.yaml. If we look at the logs (kubectl logs -f <pod-id>), sure enough:

[...]
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: b25e:00:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K80, pci bus id: b8c3:00:00.0)
[...]
Extracting /app/MNIST_data/train-images-idx3-ubyte.gz
Extracting /app/MNIST_data/train-labels-idx1-ubyte.gz
Extracting /app/MNIST_data/t10k-images-idx3-ubyte.gz
Extracting /app/MNIST_data/t10k-labels-idx1-ubyte.gz
0.9169
Model saved in file: /tmp/ckp/model

Next: Autoscaling

What happens if we want to train multiple models in parallel?
Currently, once our cluster is busy, any subsequent training would be scheduled sequentially. This isn’t great when trying to compare multiple hypothesis, or when a team of data scientist are working on different stuff on the same cluster.

We could just create a cluster with a lot of nodes instead, but the price of a GPU capable VM is pretty steep, so this isn’t a possibility for many companies.

The solution is autoscaling. By creating VMs when we need them, and deleting them when we are done, we can strike a good balance between efficiency and cost.

Autoscaling is explored in this follow up article:
Autoscaling a Kubernetes cluster created with acs-engine on Azure

If you see any mistake in this post, or have any question, feel free to open an issue on the dedicated GitHub repo.