Operationalizing TensorFlow Object Detection on Azure — Part 1: Using Docker and Deep Learning VMs
In this series of blog posts, we are going to be learning about operationalizing TensorFlow Object Detection API on Microsoft Azure.
This part, Part 1, will cover TensorFlow Object Detection API and how to setup our training and evaluation workflow using Docker containers and virtual machines.
Part 2 will cover how to train and scale using Kubernetes and distributed TensorFlow.
Finally, Part 3 will cover how we can serve our trained model using TensorFlow Serving as a web service, and we will be deploying a simple client to get results from our service.
You can find the project repository at https://github.com/sozercan/tensorflow-object-detection/
TensorFlow Object Detection API
Recently, Google released TensorFlow Object Detection API which is an open-source framework on top of TensorFlow, that makes it very easy to build, train and deploy models for object detection.
In this guide, we will be learning how to use TensorFlow Object Detection API to build and train our model in a single virtual machine and then using distributed TensorFlow to train using a Kubernetes cluster.
Using Docker and Deep Learning VMs
In this part of the tutorial, we are going to be using Deep Learning VMs in Microsoft Azure to train, evaluate and export but steps should work in any system with an NVIDIA GPU, and docker
andnvidia-docker
installed.
This is one of the reasons we are using Azure Deep Learning VMs since it makes it straightforward to use the GPU instances, and comes with preinstalled NVIDIA drivers and nvidia-docker
to make setup much easier.
If you are interested in learning more about Azure Deep Learning VMs, please check out:
To find out which Azure regions includes GPUs, please check out:
Let’s start by creating a VM:
NAME=[name of your vm]RESOURCE_GROUP=[name of your resource groupSSHKEY=[path to your public key]LOCATION=[region of your choice. make sure that GPUs are supported in that region, eg. southcentralus]az group create -n $RESOURCE_GROUP -l $LOCATIONaz vm create --name $NAME --resource-group $RESOURCE_GROUP --image microsoft-ads:linux-data-science-vm-ubuntu:linuxdsvmubuntu:latest --size Standard_NC6 --ssh-key-value $SSHKEY --admin-username $USER --public-ip-address-dns-name $NAME
Opening ports in the network security group (NSG) for TensorBoard and Jupyter notebook:
az network nsg rule create --resource-group $RESOURCE_GROUP --name Port_6006 --nsg-name ${NAME}NSG --priority 100 --destination-port-ranges 6006az network nsg rule create --resource-group $RESOURCE_GROUP --name Port_8888 --nsg-name ${NAME}NSG --priority 200 --destination-port-ranges 8888
After this is finished deploying, let’s ssh into our newly created VM:
ssh $USER@$NAME.$LOCATION.cloudapp.azure.com
and then clone our project repo:
git clone github.com/sozercan/tensorflow-object-detection
Step 1 — Creating Dockerfile
First, let’s build our Docker container:
cd tensorflow-object-detectionnvidia-docker build -f tensorflow-object-detection/docker/Dockerfile -t $USER/tensorflow-object-detection .
Dockerfile
Here is the Dockerfile:
Note that this uses a clone of tensorflow/models
for train_eval.py
, so it can do train and evaluation at the same time.You can also pull the image from sozercan/tensorflow-object-detection
. Tags are gpu
and cpu
, for GPU and CPU support.
Step 2 — Downloading pre-trained model
First, we’ll download a pre-trained model to speed up the training process. Let’s download our pre-trained model which is going to be the COCO pre-trained Resnet-101 model.
sudo mkdir -p /data/tensorflowsudo chown -R $USER /data/tensorflow/wget http://storage.googleapis.com/download.tensorflow.org/models/object_detection/faster_rcnn_resnet101_coco_11_06_2017.tar.gz -O /data/tensorflow/faster_rcnn_resnet101_coco_11_06_2017.tar.gz
Once it is done downloading, let’s unzip it using:
tar -xvf /data/tensorflow/faster_rcnn_resnet101_coco_11_06_2017.tar.gz
Step 3a — Downloading model
We will be using Pascal VOC dataset for our dataset. Dataset includes images, their bounding boxes and classifications. You can access the raw data set at The PASCAL Visual Object Classes Homepage
Downloading and extracting:
wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtrainval_06-Nov-2007.tar -O /data/tensorflow/VOCtrainval_06-Nov-2007.tartar -xvf /data/tensorflow/VOCtrainval_06-Nov-2007.tar
Converting dataset to TF Record
We created our container with TensorFlow and TensorFlow Object Detection installed in the first step, let’s jump inside and start configuring.
PATH_TO_YOUR_VOC_DATASET=[path to where you saved and extracted the files above. eg, /data/tensorflow]nvidia-docker run -it -d -p 0.0.0.0:6006:6006 -p 0.0.0.0:8888:8888 -v ${PATH_TO_YOUR_VOC_DATASET}:/data/ $USER/tensorflow-object-detection
After the last command, it will print out your container id. After getting the container id,
nvidia-docker exec -it [YOUR_CONTAINER_ID] bash
to run shell inside the container.
TensorFlow Object Detection wants our data to be in TFRecords format so we will have to convert using create_tf_record.py
utility.
As arguments, we are providing
label_map_path
, which contains the file for the labels of the objects we are tracking,data_dir
which is where we downloaded the VOC datasetyear
is for which year of the dataset we are using (in this case, it is 2007 so it'll use theVOC2007
sub directory)set
is whether we are training (train
), evaluating (val
), or both (trainval
) or testing (test
)output_path
is where the resulting.record
file will be saved
Let’s export our training and validation set:
PATH_TO_LABEL_MAP_DIR=/tensorflow/models/research/object_detection/data/pascal_label_map.pbtxtPATH_TO_TFRECORD_OUTPUT=/data/VOCdevkitpython object_detection/create_pascal_tf_record.py \
--label_map_path=${PATH_TO_LABEL_MAP_DIR} \
--data_dir=${PATH_TO_TFRECORD_OUTPUT} --year=VOC2007 \
--set=trainval \
--output_path=/data/pascal_trainval.record
Step 3b — Make your own dataset
Instead of Pascal VOC dataset, you can also bring your own images and construct your own dataset. To do this, we’ll have to tag and label images, export to TensorFlow format from VoTT and finally convert to TF Records format.
One of the utilities for labelling I would recommend is Visual Object Tagging Tool (VOTT). Using VOTT, we can easily tag and label images and videos. You can download VoTT for Windows and macOS from here.
After you download it, you can open the image folder and start labeling.
Once you are done labeling, export it as Tensorflow format.
Just like the step above, we will have to convert it to TFRecords format for Tensorflow Object Detection.
This time we will have to use a more generic way to convert since exported dataset is structured a little different. Process to convert is same as above (step 3a), but instead ofcreate_pascal_tf_record.py
, you have to use a more generic exporter. You can find generic_create_pascal_tf_record.py
in the project repo to convert your own dataset exported with VoTT.
Step 4 — Configuring our environment
Next step is to tweak any parameters and set up input and label paths. You can download an example at my repo (faster_rcnn_resnet101_voc07.config
)
At the minimum, make sure to configure PATH_TO_BE_CONFIGURED
with your relevant paths for fine_tune_checkpoint
, input_path
and label_map_path
for train and evaluation.
If you are following the guide as is:
fine_tune_checkpoint
should be/data/tensorflow/faster_rcnn_resnet101_coco_11_06_2017/model.ckpt
input_path
should be/data/tensorflow/pascal_trainval.record
for bothtrain_input_reader
andeval_input_reader
sectionslabel_map_path
should be/tensorflow/models/research/object_detection/data/pascal_label_map.pbtxt
You can also configure other options as such asdata_augmentation_options
where it will augment your training, like random horizontal flip or rotation. You can all available options at models/preprocessor.proto
Step 5 — Train and Evaluation
In this step, we will be starting our training and evaluation. This process will take a while. Sample configuration at faster_rcnn_resnet101_voc07.config
will train the model to 1000 steps but will stop after 200 steps for evaluation and then will continue until 1000 steps.
PATH_TO_YOUR_PIPELINE_CONFIG=/data/faster_rcnn_resnet101_voc07.config
PATH_TO_TRAIN_DIR=/data/train
PATH_TO_EVAL_DIR=/data/eval# from /tensorflow/models/researchpython object_detection/train_eval.py \
--logtostderr \
--pipeline_config_path=${PATH_TO_YOUR_PIPELINE_CONFIG} \
--train_dir=${PATH_TO_TRAIN_DIR} \
--eval_dir=${PATH_TO_EVAL_DIR}
Run Tensorboard
Either during (in a new tab or window) or after the above process, we can check progress against testing and evaluation using TensorBoard.
Run the following anywhere inside our container:
tensorboard --logdir=/data/
Run Jupyter Notebook
We can also run the sample Jupyter notebook to check if everything is working correctly.
Run the following anywhere inside our container:
jupyter notebook --allow-root
Export
To prepare to serve our model later, let’s export our inference graph as a frozen model.
Run the following inside our container:
# from tensorflow/models/research/python object_detection/export_inference_graph.py \
--input_type encoded_image_string_tensor \
--pipeline_config_path ${PATH_TO_YOUR_PIPELINE_CONFIG} \
--trained_checkpoint_prefix /data/train/model.ckpt-##### \
--output_directory /data/export
Conclusion
Even though Docker containers made the training process much simpler, this was pretty manual work and we only used 1 GPU inside 1 VM so our training can be much faster and effective.
In part 2, we are not only going to look into automating this using Kubernetes, we are going to learn how can we train and scale with distributed TensorFlow.
If you have any questions or comments, please leave a comment below or reach out to me on Twitter @sozercan