Creating a Kubernetes Cluster with GPU Support on Azure for Machine Learning
In this article, we are going to see how to create a Kubernetes cluster that is suited for Machine Learning. Our cluster should have the following extra capabilities:
- Support for GPUs
- Automated NVIDIA driver installation
In follow up articles, we are also going to talk about:
- Autoscaling (because GPUs are expensive 😃)
- Visualizing the different training happening in your cluster in real-time with TensorBoard
Creating a Kubernetes cluster using
First, we need to create a Kubernetes cluster that supports GPUs.
We will use
acs-engine, a tool that will generate the ARM template we need to deploy our cluster with everything already configured.
One small caveat: the NVIDIA drivers are not automatically installed for you with acs-engine (associated issue here), so in the mean time we are going to use my fork of acs-engine which has everything we need, making our life much easier, especially if you intend to use autoscaling or provision large number of VMs.
First, clone the repo and make sure you are on the k8s-gpu branch (that should be the default branch):
> git clone https://github.com/wbuchwalter/acs-engine
> git checkout k8s-gpu
Now we need to specify what we want our cluster to look like. Edit
example\kubernetes.json and fill the different parameters.
The interesting part is the
There you can define a number of different pools. Each pool can have a different VM size, and can scale up to 100 nodes.
You should define a separate pool for every different VM size you intend to use.
For example, if you plan to use your cluster both for training with GPU and inference with CPU only, you should at least specify two pools as you don't want to pay for a GPU if you are not using it.
The number of agents isn’t really important because we are going to enable autoscaling, so you can keep everything to 1.
At the time of this writing, Azure has 6 different VM sizes with GPU support, you can see the details here.
Here is what your
kubernetes.json should approximately look like:
Let’s generate the ARM template, you’ll need docker to be installed on your machine. Check out the official documentation to understand how acs-engine works.
> make prereqs
> make build
> ./acs-engine generate examples/kubernetes.json
This should have generated a bunch of files under the
_output/mygpucluster directory, including the ARM template and parameters that we want.
To deploy them, you can create a new
template deployment in the Azure portal and copy paste
Make sure you choose a region that has N-Series VM available! South Central US is one of them.
Or use Azure CLI:
> cd _output/mygpucluster
> az group create --location southcentralus --name mygpucluster
> az group deployment create --template-file azuredeploy.json --parameters @azuredeploy.parameters.json --resource-group mygpucluster
This should take between 5 and 10 minutes to deploy.
Do not delete the generated
azuredeploy.parameters.json as you will need them later if you want to setup autoscaling.
Once the deployment is completed, grab the Kubernetes config file of you cluster to be able to use
> scp azureuser@<dnsname>.<regionname>.cloudapp.azure.com:.kube/config ~/.kube/config
If you don’t have
kubectl installed, now is the time: Installing and Setting Up kubectl
Testing the cluster
Let’s check our new cluster
> kubectl get nodes
NAME STATUS AGE
k8s-agentpool1-19661165-0 Ready 1m
k8s-agentpool2-19661165-0 Ready 23s
k8s-master-19661165-0 Ready,SchedulingDisabled 1m
One master, and two agents, one for each pool. That's what I requested.
Let’s describe one of our agents:
> kubectl describe node k8s-agentpool1-19661165-0
We can see that the drivers have been correctly installed since kubernetes has been able to find our GPU device.
nvidia-smi to prove that GPU support is working correctly:
Download this kubernetes template somewhere and deploy it on your cluster (don’t worry about the details of the
.yaml file, we will get into this later on):
> kubectl create -f nvidia-smi.yaml
The deployment will take some time to finish, but you should eventually see:
> kubectl get pods --show-all
NAME READY STATUS RESTARTS AGE
nvidia-smi-fcg8j 0/1 Completed 0 1m
And if we check the logs:
Creating an image to run on our cluster
You might already have a docker image that you want to use on your cluster for training/inference, in that case you can skip ahead.
If not, you can use this example repository: wbuchwalter/tf-app-container-sample.
Here are the things I wanted my container to be able to do (you might have different requirements, that’s fine):
- Ability to train our model, or serve prediction using the same container depending on a flag.
- The app should save model checkpoints when the training is done to an azure blob storage, and restore the model before serving prediction.
- Ability to run on CPU or GPU.
It is a very simple model in TensorFlow based on the MNIST sample. You can specify
--train to start the container in training mode, otherwise a Flask application will serve (random) predictions on the
/predict route to demonstrate serving.
Training our model on the cluster
To train our model on our Kubernetes cluster, we are going to define a job template.
Because we want to train using GPUs, we have to specify how many of them we need:
We also need to expose the NVIDIA drivers from the host into the container. This is a bit tricky for now. The correct mount paths will depend on many things: how you installed the drivers, which host OS you are using etc.
Because the official Tensorflow image is based on Ubuntu 16.04, just like our host VMs in ACS/ACS-engine, we can simply map the
lib directories right away without too much issues.
Here is what the template looks like if you used the
tf-app-container-sample container on acs-engine (note the use of
As you can see, we are directly mounting
/usr/lib/x86_64-linux-gnu into the container. This two directories contain everything needed to communicate with the GPU.
libcudnn.5isn’t found with this template, take a look at this fix: https://github.com/wbuchwalter/blog-posts/issues/1
To create this job, simply
kubectl create -f tensorflow-trainer.yaml. If we look at the logs (
kubectl logs -f <pod-id>), sure enough:
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: b25e:00:00.0)
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:1) -> (device: 1, name: Tesla K80, pci bus id: b8c3:00:00.0)
Model saved in file: /tmp/ckp/model
What happens if we want to train multiple models in parallel?
Currently, once our cluster is busy, any subsequent training would be scheduled sequentially. This isn’t great when trying to compare multiple hypothesis, or when a team of data scientist are working on different stuff on the same cluster.
We could just create a cluster with a lot of nodes instead, but the price of a GPU capable VM is pretty steep, so this isn’t a possibility for many companies.
The solution is autoscaling. By creating VMs when we need them, and deleting them when we are done, we can strike a good balance between efficiency and cost.
Autoscaling is explored in this follow up article:
Autoscaling a Kubernetes cluster created with acs-engine on Azure
If you see any mistake in this post, or have any question, feel free to open an issue on the dedicated GitHub repo.