Operationalizing TensorFlow Object Detection on Azure — Part 2: Using Kubernetes to train distributed TensorFlow Object Detection API model

7 min readNov 20, 2017

In this series of blog posts, we are going to be learning about operationalizing TensorFlow Object Detection API on Microsoft Azure.

Part 1 covered TensorFlow Object Detection API and how to setup our training and evaluation workflow using Docker containers and virtual machines.

This part, Part 2, will cover how to train and scale using Kubernetes and distributed TensorFlow.

Finally, Part 3 will cover how we can serve our trained model using TensorFlow Serving as a web service, and we will be deploying a simple client to get results from our service.

You can find the project repository at https://github.com/sozercan/tensorflow-object-detection/

Using Kubernetes to run distributed TensorFlow Object Detection API

In this part, we are going to learn to use Kubernetes to schedule our TensorFlow container in a distributed way. To make things much easier, we are going to be utilizing the new tensorflow/k8s TFJob custom resource definition (CRD).

Using `acs-engine` to create a GPU cluster

We will be using a Kubernetes cluster so we’ll start by creating a GPU enabled cluster in Microsoft Azure. If you already have a GPU enabled cluster, you can skip this step.

We are going to be using acs-engine to deploy a custom GPU cluster. ACS (Azure Container Service) or AKS (managed Kubernetes service) does not support GPUs at the time of this post (see update below), so using acs-engine is required because of GPU support and automatic NVIDIA GPU driver installation.

Update: AKS supports GPUs using the N-series GPU VMs now. How to get started:

GPUs on Azure Kubernetes Service (AKS)

In this article AKS supports the creation of GPU enabled node pools. Azure currently provides single or multiple GPU…

docs.microsoft.com

Download acs-engine prebuilt binary for your platform of choice from https://github.com/Azure/acs-engine/releases and make sure you have Azure CLI 2.0 installed already.

Let’s deploy our cluster:

# gpu_example is detailed belowcd /path/to/your/acs-engine/binary/./acs-engine generate gpu_example.json# this will generate the necessary Azure templates to deploy your clusterSUBSCRIPTION_ID=[your subscription id]
RESOURCE_GROUP=[your resource group name]
LOCATION=[Azure region that includes GPUs, you can check here]
DNS_PREFIX=[your DNS prefix]az loginaz account set --subscription $SUBSCRIPTION_IDaz group create \
 --name $RESOURCE_GROUP \
 --location $LOCATIONaz group deployment create \
 --resource-group $RESOURCE_GROUP \
 --template-file "./_output/${DNS_PREFIX}/azuredeploy.json" \
 --parameters "./_output/${DNS_PREFIX}/azuredeploy.parameters.json"# deploying will take some time

This is our gpu_example.json (also found in project repo, make sure to replace [REPLACEME] with your values). For generating Service Principal credentials, please see here.

Using the above configuration, there will be 1 Master and 3 GPUs agents created. We will be utilizing 3 GPU agents when we do our distributed TensorFlow training using 1 TensorFlow master node and 2 TensorFlow agent nodes.

Let’s export our Kubernetes configuration (kubeconfig) file to be able to use our cluster:

export KUBECONFIG=/path/to/your/acs-engine/_output/${DNS_PREFIX}/kubeconfig/kubeconfig.${LOCATION}.json

After deployment is finished (and additional 10–15 minutes for drivers to be installed), verify that your cluster includes GPU support by using:

kubectl describe nodes

This should return agent nodes with more than 1 GPU, so output should look something like:

alpha.kubernetes.io/nvidia-gpu: 1

We need to have helm installed so we can deploy our charts. You can install helm from GitHub release page for your platform of choice.

Make sure that we have latest version by using helm init --upgrade

Please note that Tiller that comes bundled with acs-engine (as well as ACS and AKS) already includes necessary service account and role bindings for cluster access. If you are working on a different cloud provider, make sure to install necessary service account and role bindings.

Storage Account

In this step, we are going to create a new storage account so we can save our training data to a storage bucket. To do this, we need to create a file storage so we’ll be able to mount to different nodes at the same time:

STORAGE_ACCOUNT_RESOURCE_GROUP=[resource group for the storage account]
STORAGE_ACCOUNT_NAME=[your storage account name]az storage account create --resource-group $STORAGE_ACCOUNT_RESOURCE_GROUP sku Standard_LRS — name $STORAGE_ACCOUNT_NAME

Let’s grab the access keys, we will need those in the next step:

az storage account keys list -g $STORAGE_ACCOUNT_RESOURCE_GROUP -n $STORAGE_ACCOUNT_NAME

Creating a share:

az storage share create --name data --account-name $STORAGE_ACCOUNT_NAME --account-key [key from above step]

To make things easier, we’ll have to copy some of the assets we downloaded and created in Part 1, so let’s SSH back into our virtual machine:

ssh $USER@$NAME.$LOCATION.cloudapp.azure.com# in the virtual machine heresudo mkdir -p /filesharesudo mount -t cifs //$STORAGE_ACCOUNT_NAME.file.core.windows.net/data /fileshare -o vers=3.0,username=$STORAGE_ACCOUNT_NAME,password=$STORAGE_ACCOUNT_KEY,dir_mode=0777,file_mode=0777,sec=ntlmsspsudo cp -rf /data/tensorflow/export /data/tensorflow/faster_rcnn_resnet101_coco_11_06_2017 /data/tensorflow/faster_rcnn_resnet101_voc07.config /data/tensorflow/pascal_trainval.record /fileshare# versioning the model for tensorflow serving (for part 3)MODELVERSION=1
sudo mkdir -p /fileshare/export/saved_model/${MODELVERSION}
sudo mv /fileshare/export/saved_model/* /fileshare/export/saved_model/${MODELVERSION}# saving our label map. in this case, it is the VOC Pascal dataset. if you are using your own dataset, it should be your own label mapwget https://raw.githubusercontent.com/tensorflow/models/master/research/object_detection/data/pascal_label_map.pbtxt -O /fileshare/pascal_label_map.pbtxtexit# exited our virtual machine here

Deploying TensorFlow/k8s CRD

In this part, we are going to be deploying our Custom Resource Definition (CRD) for TensorFlow support. This will give us a new resource definition called TFJob. You don’t need TFJob to be able to run TensorFlow jobs in Kubernetes, however, TFJob makes it easier to run distributed and non-distributed TensorFlow jobs, such as convenient GPU driver mounting from host, and includes built-in support for TensorBoard.

If you are interested in learning more, please check out:

tensorflow/k8s

k8s - Tools for ML/Tensorflow on Kubernetes.

github.com

We’ll start by installing TFJob CRD with Helm:

CHART=https://storage.googleapis.com/tf-on-k8s-dogfood-releases/latest/tf-job-operator-chart-latest.tgzhelm install ${CHART} -n tf-job --wait --replace --set rbac.install=true,cloud=azure

Deploying TFJob

Next step is to deploy our distributed TFJob consisting of masters, workers, and parameter servers.

Let’s look briefly into a sample of TFJob deployment:

from https://github.com/tensorflow/k8s#creating-a-job

In each replicaSpec, we define the replicaType which is either a master, worker or parameter server (ps), and number of replicas, and container definition. In our chart, we also have TensorBoard and Azure File Storage defined.

Now, we can deploy our chart:

helm install tensorflow-object-detection/chart --name tensorflow-object-detection --set azure_storage_account_name=$STORAGE_ACCOUNT_NAME,azure_storage_account_key=[your storage account key]

Let’s look into all the options:

train_dir_path: path where training checkpoints are saved (/data/train by default)
pipeline_config_path: config file path (/data/faster_rcnn_nas_voc07.config by default)
eval_dir_path: path where evaluation events are saved (/data/eval by default)
log_dir_path: where TensorBoard events are stored (/data by default)
azure_storage_account_name: your Azure storage account name (no default)
azure_storage_account_key: your Azure storage account key (no default)

After deploying our TFJob from the Helm chart, we can execute kubectl get tfjob to see the deployed TFJob.

And you should see masters, workers and parameter server pods starting to run:

Note that if you are getting Pending as status, make sure you have enough GPU nodes available.

Once pods starts, masters and workers will train the model in a distributed way. If you execute kubectl logs $PODNAME , it should look like this:

output from parameter server discovering 1 master, 2 workers and 2 ps

As mentioned before, TFJob CRD comes with TensorBoard integrated, which makes it really easy to deploy TensorBoard. Let’s check our TensorBoard service, grab the external IP when it’s available from:

kubectl get svc -l app=tensorboard

tensorboard service

Let’s navigate to our external IP listed above for TensorBoard:

If you would like to get more information about distributed TensorFlow, please check out:

Distributed TensorFlow | TensorFlow

A TensorFlow "cluster" is a set of "tasks" that participate in the distributed execution of a TensorFlow graph. Each…

www.tensorflow.org

Autoscaling

If you need to scale your cluster on demand, you can use the acs-engine autoscaler to scale your cluster. Please note that this only works on acs-engine, not ACS or AKS.

It is easy to install with Helm:

helm install stable/acs-engine-autoscaler -f values.yaml

where values.yaml contains:

acsenginecluster:
  resourcegroup: [your resource group]
  azurespappid: [application id]
  azurespsecret: [application secret]
  azuresptenantid: [tenant id]
  kubeconfigprivatekey: [kubeconfig private key]
  clientprivatekey: [client private key]
  caprivatekey: [ca private key]
  acsdeployment: [acs deployment name]

You can find above information, such askubeconfigprivatekey, clientprivatekey, caprivatekey, in the generated acs-engine/_output folder.

For more details, check out this post:

Autoscaling a Kubernetes cluster created with acs-engine on Azure

This article assumes that you already have a Kubernetes cluster created with acs-engine up and running. If that is not…

medium.com

Conclusion

In this part, we learned how to create a GPU-enabled Kubernetes cluster using acs-engine on Azure, how to deploy tensorflow/k8s TfJob CRD and train and evaluate a TensorFlow Object Detection API model using distributed TensorFlow.

In the next part, we are going to look at how to build a web service on Kubernetes to serve this model using TensorFlow Serving.

If you have any questions or comments, please leave a comment below or reach out to me on Twitter @sozercan