Azure AutoML for Images: Baseline and beyond for Computer Vision models

Published in

Microsoft Azure

10 min readNov 29, 2021

Establish a baseline performance for State-of-the-Art (SoTA) computer vision models with ease and use tuning to take them a step further.

Co-authored by YiYou Lin.

Introduction

Today’s world of Computer Vision offers different models for tasks like Image Classification, Object Detection, and Instance Segmentation. Each of these models has its strengths like some being faster, some being more accurate or some being good at specifics like identifying small objects, detecting anomalies, etc. The optimal model to go for depends on the user scenario at hand.

In this blog post, we will cover:

How data scientists can quickly explore different models for computer vision tasks using automated machine learning capabilities for computer vision in Azure Machine Learning. For this post, we will use object detection as an example.
How can one tune the hyperparameters of these models by taking some specific insights from the training data and the baseline model performance obtained in step 1.

What is Azure AutoML for Images

With Azure AutoML for Images, users can explore and select from a variety of SoTA algorithms for a computer vision task and optionally tune the hyperparameters to optimize model performance with ease.

Automated ML solutions for computer vision use pretrained versions of state-of-the-art models and finetune them on an incoming dataset to provide a trained model with less effort required from the user. Some solutions are more black-box and closed in nature while others provide more transparency and flexibility by offering a powerful toolbox at disposal.

Azure AutoML for Images falls under the latter category; it supports Image Classification, Object Detection, and Instance Segmentation tasks and allows users to try out different models for these tasks without the need to write any training code. It also helps them to tune the model hyperparameters with ease in a cost-effective manner.

**Computer Vision Tasks supported by Azure AutoML for Images**

Pre-requisites

Before we start with the workflow steps of model building with Azure AutoML for Images, we will need to set up an Azure Machine Learning workspace where all the machine learning artifacts for the experiments are maintained. We will also need to set up an automl client environment and a compute target for executing the runs. These pre-requisites are covered in the AutoML Object Detection tutorial.

Workflow

Below is the high-level summary of the steps in the model building workflow using Azure AutoML for Images which we will discuss in detail in the subsequent sections.
1. Data Preparation — In this step, we annotate the images with labels to set up ground truth information and create a dataset from the annotations for training the model.
2. Model Sweep — We start by setting up performance goals that need to be satisfied by our model in solving the business problem at hand. We then try different models at one go and choose the best models. In this step, we try the models with their default settings for the hyperparameters.
3. Hyperparameter Sweep — In this optional step, users can further tune the model hyperparameters based on insights derived from the model sweep runs in the previous step and dataset characteristics that can be derived by analyzing the training dataset.

**Model Building Pipeline — Azure AutoML for Images**

We will walk through the workflow steps with a sample dataset, Kitti.

Kitti: Object Detection in Autonomous Driving scenario

Kitti dataset is one of the most well-known datasets in the field of autonomous driving, consisting of real-world, high-resolution images for computer vision tasks such as 2D/ 3D object detection. We will be using the 2D object detection dataset for this blog.

**2D Object Detection with Kitti Dataset**

Data Preparation

To use an image dataset with Azure AutoML for Images, the data preparation steps can be as simple as:

1. Create a JSONL annotation file locally following this schema (Or label the images with Azure ML Labeling Tool).

2. Upload the annotations and images to Azure Storage and register the dataset using the below lines of Python code.

Register Annotations as Tabular Dataset

Model Sweep for Baseline Performance

Before we look at how we can explore the models supported by Azure AutoML for Images, we will set up model performance goals to guide the choice of models we want to explore.

Setup model performance goals

While there are many models available for a computer vision task, there is always a trade-off between accuracy and model size, that impacts the inference speed and memory requirements. The mAP vs FLOPs (Floating point operations per second) tradeoff plot shown below generated using the reported model performance from papers on the COCO dataset explains this behavior. The choice of model depends on the user scenario. For example, if you want to have a model deployed on a mobile device, you might have to sacrifice some accuracy by choosing a more light-weight model.

For this Kitti demo, we will need to understand the usage scenario to set up model performance goals. Let us do it by asking some questions,

Do we want to compete for the leaderboard? If so, we may use the Yolov5 extra-large model or the largest Faster R-CNN model which yields the best results.
Do we want to build a small model for a road surveillance scenario to count road traffic? If so, we want a smaller model like Yolov5 small as we do not care about the precision that much. For example, it does not matter if a Van is misidentified as a Car.
Do we want to build a model optimized to detect small objects like cyclists or vehicles taken at long-shot in high-resolution images? If so, enabling techniques like tiling can be helpful, which will require sacrificing some speed. Refer to the small object detection tutorial to understand how tiling works.

For this blog, we will keep it simple — we will just maximize the mAP (Mean Average Precision) for all vehicle and people-related classes at a reasonable speed.

Model Sweep (With Default Settings)

With Azure AutoML for Images, the baseline performance of models can be quickly established by employing grid-sampling to sweep over a choice of models with default settings for hyperparameters. The Configure algorithms documentation shows the list of available models for different computer vision tasks.

For example, we can quickly sweep over Yolov5, Faster R-CNN with ResNet50-FPN, and RetinaNet with ResNet50-FPN on Kitti with the below lines of code:

Model Sweep with Default Settings for Hyperparameters — Azure AutoML for Images

Compare the Results from Models

It is also easy to compare the results of the different runs from the model sweep experiment we performed by retrieving the metrics from the runs.

**mAP Performance — Model Sweep Runs with Default Settings for Hyperparameters**

**mAP Performance — Model Sweep Runs** **[Default no. of epochs=30 for Yolov5 and 15 for others]**

Just by looking at the above mAP plot from the model sweep experiment, we derive that Yolov5 attained the best performance on this dataset, reaching an mAP of 0.922. The mAP trend as seen from the plot seems to have saturated for Yolov5 with the default number of epochs. RetinaNet and Faster R-CNN may still need a few more epochs to saturate. We will choose Yolov5 and Faster R-CNN for tuning.

Hyperparameter Tuning

Though we arrived at a great baseline performance from the selected models with default settings for the hyperparameters, we may still choose to further tune the models to squeeze out the last bit of performance from the models for our dataset. To do this, we will do some quick dataset analysis to gather some insights that can help while tuning the model.

Kitti Dataset Analysis

The Image Resolution — In the Kitti dataset, the image resolutions are almost constant at 375 * 1242 with an aspect ratio of 1:3. Images will be resized while being passed through the model and this downsizing can impact performance negatively. For example, as the default input image size of Yolov5 is 640*640 with an aspect ratio of 1:1 (Refer to the default value of img_size hyperparameter for Yolov5 from documentation), Kitti images will be downsized to 256*640 to fit the model.

**Downsizing images to fit the model can impact model performance**

Yolov5 may exhibit improved performance when the image size is set at higher resolutions to match the dataset resolutions. Hence, we will tune Yolov5 and the Faster R-CNN models with larger image resolutions from the default. We will tune the img_size hyperparameter for Yolov5 and min_size hyperparameter for Faster R-CNN.

The Object Sizes — We scan through the Kitti images and find that many objects are identifiable through human eyes. However, we should still mention that 1/4th of the bounding boxes are small. The image below is a good demonstration of small bounding boxes for the Car class. If we shrink the images further as the previous section described, the objects will be even smaller.

We can also note from the bounding box size distribution plot below, which shows the distribution of bounding box width vs height for all the classes in Kitti, that the object size resolution distribution is concentrated towards zero.

**Bounding box size distribution plot — bbox width Vs bbox height for classes in Kitti Dataset**

We can optimize the model performance for small objects detection by tuning hyperparameters with a special technique called tiling, used in small object detection scenarios. But as optimizing for small objects is not our current goal, we will exclude tuning the hyperparameters for tiling.

Select the Most Promising Hyperparameters

Based on the insights we got from dataset analysis and based on the model sweep run outcomes, we make the below choices for the hyperparameter tuning experiment. Please refer to the documentation for the complete list of hyperparameters and their default values.

As we noted from the mAP plot of the model sweep runs, we can train Faster R-CNN for a greater number of epochs.
As we noted from the dataset analysis of Kitti, we will train Yolov5 with larger image resolutions apart from the default values by adjusting the img_size hyperparameter. We will also try with different values for min_size hyperparameter for Faster R-CNN.
We will use smaller batch sizes for the training and validation dataset as we are using larger image resolutions to prevent OOM (out-of-memory) issues. We will also adjust the grad_accumulation_step hyperparameter to help mimic larger batch sizes by performing weight updates only at a frequency set by this parameter.
Apart from the default value for the learning rate, we will also sweep over other learning rate values for both Yolov5 and Faster R-CNN.

Considering the above, let us see an example of a hyperparameter tuning configuration for the Kitti dataset where we simultaneously tune the hyperparameters of both Faster R-CNN and Yolov5:

Hyperparameter Tuning Experiment — Azure AutoML for Images

The above configuration makes use of conditional blocks to specify hyperparameter configurations specific for a model. Refer to the documentation for model specific and model agnostic hyperparameters.

Setting up a Tuning Budget

While performing hyperparameter tuning, it is imperative that we optimize the cost of the tuning exercise. Setting up a budget for the hyperparameter sweep with the help of parameters like max_duration_minutes or max_total_runs can help constrain the duration of the run or the maximum number of configurations that will be experimented.
In addition, it is also recommended to use early termination policies for terminating the least performing runs. Refer to the Termination policy documentation for the supported termination policies.

Results Analysis

The performance leaderboard of various configurations tried in the hyperparameter tuning runs is shown below, the winning model is Yolov5 with an mAP of 0.945. Default values are used for hyperparameters if not specified in the hyperparameter sweep configuration.

Performance Leaderboard of Hyperparameter Tuning Runs — Azure AutoML for Images

Yolov5’s model performance improved by 2.3 points after tuning. The best-tuned model for Faster R-CNN uses 30 epochs instead of 15 as in default, which confirms our assumption that the mAP had not saturated from the default runs and training for more epochs helped. The best-tuned model for Yolov5 uses a larger input size of 960 which confirms our assumption that the images have been downsized with the default input size and larger image resolutions helped to improve the performance.

Inference Configuration

As we observed from the hyperparameter tuning experiment, larger image resolutions worked best for both models. It is important that the images are resized to this value during prediction time as well to get similar performance. Refer to the documentation for configurations that can be specified during the inference time.

Conclusion

We covered in this blog post, how Azure AutoML for Images can significantly improve the productivity of data scientists by allowing them to explore multiple SoTA models for computer vision tasks in one go without writing training code. It also offers a lot of flexibility and control over the choice of models and the hyperparameters to tune to get an optimal model performance for your dataset in a cost-effective manner.

Resources

Notebooks with all the workflow steps we covered in this blog are available in this GitHub project: yoyolin/kitti (github.com)
Set up AutoML for computer vision-Azure Machine Learning
Tutorial: AutoML- train object detection model-Azure Machine Learning
Use AutoML to detect small objects in images-Azure Machine Learning
Prepare data for computer vision tasks -Azure Machine Learning

Sample notebooks: