Object Detection with Custom Vision

[Part 2] A Baseline for Object Detection Task, using Azure Cognitive Services

Julián Gutiérrez Ostrovsky
Hexacta Engineering
6 min readJun 5, 2020

--

Let’s catch up

This post follows this one, in which we have some specific Object Detection task based on Kaggle’s Global Wheat Detection competition, and we’ve tried several frameworks in order to give some pros and cons on deciding which one adapts better to our needs. But we didn’t talk about Microsoft part of the game there, and that is because it’s not much similar to the rest. We’ll do it now, and then you’ll probably have a bigger picture.

We are going to review the main concepts of the Custom Vision module for Azure Cognitive Services, and we will give a shot by training our wheat dataset.

Custom Vision is a SaaS (software as a service) to train and deploy a model as a REST API given a user-provided training set. Basic steps include image upload, tag annotation, model training and deployment.

You can perform these tasks in the customvision.ai website by hand, or using Python SDK. Because of the size of our dataset, and given that we already have tag annotations in an .xlsx file, we will use the latter choice.

What’s underneath custom vision service remains hidden (therefore out of reach) for us. Searching out there, we’ve found here that it is based on Deep Neural Nets and uses the ResNet, AlexNet and Resnet-152 algorithms for image classification. We didn’t find any official source to back up this information, so we are not quite sure if it’s entirely true.

Getting Started on Model Training

Specifically for this competition, cloud computing is off the table, and there’s no chance to download the final trained model from azure portal. But we will train it anyway just to learn.

As we mentioned before, we have over 3k images with a total of over 14k boxes. Web portal does not have any option to upload images massively nor to import a file with tag boxes in any format.

The idea behind custom vision website looks more like a toy tool to play around uploading a minor amount of images and annotating them by hand for a small project and have quick results to run and tell your boss.

Of course if this is your case, then this could be a great tool for you. It has everything you need to create your dataset, train it, measure it and use it.

In the case of other frameworks, to generate the images with boxes and adapt it not only to your model, but also to what that framework understands, could be very time consuming. That is solved here.

In order to start doing anything, you should sign up for Azure Custom Vision. This is a paid service. Nevertheless we’ve chosen a free version and looks more than enough to try it. You’re able to upload up to 100k images, create 500 tags at most and have 10 model iterations alive at the same time.

Azure Custom vision provides non complete but enough documentation to create a python file -or maybe some notebooks cells- and be able to create a new project from scratch, upload images with it’s boxes normalized, model training and predict new images. It won’t be possible to blindly paste the suggested code and run it fulfilling every need, but it’s clear enough to detect all these steps and make needed changes. After model training, a prediction URL will be available to hit with an image and get prediction scores.

Sample for dataset
Image uploader. Keys are every image filename

We can see so far how our dataset looks like, and a function to upload images. We don’t upload all images at the time. Right now, Azure supports a maximum of 64 images to upload at the same time. So basically for every image name, we collect boxes. Once we reach batch size, we upload images to Azure server and keep iterating.

Model Score, Fine Tuning and Iteration

After about 2 hours uploading images and training those 4 iterations, we hit F5 in the custom vision web site, go to the performance tab and…

Scores after each train iteration for a 25% overlap

Given the little effort we made, we’ve got some really good scoring, at least for a threshold overlap of 0.25 which is low. Checking for 0.75 isn’t that accurate, we’ve got a mAP of 7.5%, and for 0.5 threshold a mAP of 69.5%. There are some other problems too, like more than one prediction for the same tag and several false positive cases (which is reflected in precision score)

Red boxes are Azure predictions. Light blue are False Positive; Purple are missed.

Even so, comparing against faster_rcnn_inception_resnet_v2_atrous_coco_2018_01_28 from tensor flow OD API trained for 5175 steps we’ve got similar mAP results for a 0.5 threshold, and as you may know if you read part 1 of this post, it wasn’t easy to get there.

mAP results from tensor model after 5175 steps

Reaching this point, we believe ourselves as Object Detection experts, and we are ready to start fine tuning those complex deep neural nets in order to improve our results. We are sorry, you cannot.

According to How to Improve Your Classifier documentation section, to achieve that, all you can do is: Add more images and balance data and retrain; Add images with varying background, lighting, object size, camera angle, and style and retrain; Use new image(s) to test prediction; Modify existing training data according to prediction results… Ok. We are really in the presence of a black black box.

Summary

As a final conclusion on this Azure service, we could say that if we quickly need a PoC, or a demo for some potential client, this is a very powerful tool. But it doesn’t seem to offer much more than that. So probably for a big project you will need to think twice if Custom Vision is the best option for you.

Every alternative described in the previous post is much more robust, flexible and malleable. But also, let’s be fair, with a harder learning curve and probably more sophisticated hardware requirements.

PROS

  • With very little effort and real quick, we’ve got a very powerful trained model for Object Detection
  • It’s a really visual tool where we can visualize our whole dataset and be able to get measures and tests for our model
  • Automatically after training, it generates a prediction URL to quickly build an API and predict unknown new images (There’s official doc for these too).
  • Learning Curve is really low and you barely need knowledge on the subject.
  • You don’t need any special hardware at all.

DRAWBACKS

  • There’s no chance to perform fine tuning of the model. The possibility of adapting training better to our data is almost nonexistent.
  • We couldn’t find enough documentation for all Custom Vision classes that we’ve used for this project. So it’s hard to explore different possibilities or paths to go through others than the one we described in this post.
  • There’s no possibility to download a frozen trained model to use it locally. It all remains in the cloud.
  • It’s a service that looks more suitable to perform PoCs or play with some toy data than to really make a deep professional development, despite the fact that it is a paid tool, and an Azure Service.

Here’s the link to complete notebook.

--

--

Julián Gutiérrez Ostrovsky
Hexacta Engineering

Developer. Computer Science Student. Passion for knowledge. Love for Music.