Monitoring your Kubernetes training cluster with TensorBoard on Azure
In this article, we are going to look at how you can deploy TensorBoard on Kubernetes to visualize the different trainings happening on your cluster in real time.
If you don't already have a Kubernetes cluster up and running, check out my previous story: Creating a Kubernetes Cluster with GPU Support on Azure for Machine Learning.
TensorBoard will need a way to access all the log files capturing the summaries from the different trainings happening in the cluster.
To do so, we are going to save the log files in an Azure File storage. An Azure File share can be accessed from multiple VMs at the same time. In our case, we will have a number of containers writing in this share, and a single one (TensorBoard) reading from it.
So here is the plan:
- Create a new storage account on Azure
- Save TensorFlow summaries as Azure Files in the new storage account
- Create a new deployment, running TensorBoard that will read log files from the storage account
- Create a new service to expose the TensorBoard deployment
Creating the storage account
Simply create a new storage account in the Azure Portal.
This storage account must be in the same region as your cluster.
Once the storage is created, grab the account name and a key, encode them in base64 and create a new Kubernetes secret to store them.
You should end up with something like that:
In your storage account, click on Files, and create a new File Share named
Saving TensorFlow summaries in an Azure File storage
Now that we have access to a storage account, we can modify our training job template, and mount the storage account as a volume in our pod.
Here is how to do it (only the relevant part are shown for clarity):
Of course, you need to make sure that TensorFlow is saving the log files in a new directory under the mountPath. We don't want multiple instances saving their logs under the same directory.
So you can either create a new file share in the storage account and change the mountPath itself on very new deployment, which isn't very handy, or add a new argument to your TensorFlow application specifying where to save summaries, and then simply save in this folder.
Make sure you don't save your session too often though, Azure File is using SMB behind the scene so it will significantly slow your training if you save on every step.
Creating a new deployment running TensorBoard and exposing it with a service
Deploying TensorBoard is pretty straightforward. It is already included in the official TensorFlow docker image, so we can simply reuse it without any modifications.
We need to mount the same Azure File share as above, so that TensorBoard can pick up the logs from our trainings, and finally, we create a service that exposes the port
6006 on the container
Then we can simply run TensorBoard by pointing it to the
Visualizing our trainings
Once everything is deployed correctly, and you have some training jobs running, grab the TensorBoard service external IP
> kubectl get servicesNAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes 10.0.0.1 <none> 443/TCP 11d
tensorboard 10.0.219.75 18.104.22.168 80:32745/TCP 1m
Go to this IP, and you should see your trainings happening live
If you see any mistake in this post, or have any question, feel free to open an issue on the dedicated GitHub repo.