Object Detection Tutorial with torchvision

Published in

Jumio Engineering & Data Science

6 min readOct 22, 2020

Torchvision, a library in PyTorch, aids in quickly exploiting pre-configured models for use in computer vision applications. This is particularly convenient when employing a basic pre-trained model for inferencing and finding the same target objects as those in the dataset used for pre-training. When using a customized dataset, however, the model often needs to be re-trained and adapted for those more specialized use cases.

For example, a use case might require only a single-class detector with a low incidence of objects per image. Modifying Faster R-CNN from multi-class to single-class mode can grant higher performance and faster training time for this scenario. Our team encountered such use cases and want to share how we achieved this conversion, including the conceptual framework and API parameter changes.

Dataset

The first step is building a customized dataset for torch DataLoader to use in training. The example code in this tutorial provides a good reference for this task. To realize the better performance of a single-class detector, the training dataset should include challenging negative samples. This helps it learn to differentiate between true and similar-looking but false cases. For example, a Pikachu detector might include images of Ditto, Mimikyu, and the real Pikachu in the training dataset.

The negative samples don’t need bounding box annotations, so Torchvision can easily prepare them. This capability, however, is only supported in version 0.6.0 or newer. The __getitem__() method in the dataset class should resemble this snippet:

Model

Posts such as this and this already explain how Faster R-CNN works, so we won’t rehash that topic. Instead, following a brief examination of the overall structures of Faster R-CNN, let’s focus on the changes necessary to alter the pre-trained model for use with a single-class detector.

Structure Overview

Faster R-CNN contains a CNN to extract feature maps from input images. These feature maps are then used for proposing candidate regions as well as predicting the score and refining the bounding box in each proposed region. One of the feature extraction models available in torchvision is based on Resnet-50, trained with the COCO train2017 dataset, which can identify around 90 categories. The extracted feature maps are then sent to a Region Proposal Network (RPN) for predicting the regions (bounding boxes) that possibly contain target objects, and each proposed region includes a predicted objectness score. Finally, the top N proposed regions and feature maps are passed to the classifier to decide if they contain a target object and to refine their bounding boxes.

Overall structure of Faster-RCNN. Source: Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Region Proposal Network (RPN)

The RPN is a small, specialized network that slides over the feature maps and predicts which regions contain target objects. It then assigns the predictions an ‘objectness’ score, indicating the probability that they contain a targeted object. Objects will have different shapes and scales, so the RPN must provide proposed regions with different sizes and shapes to capture these diverse objects. The combinations of scale and ratio for regions are referred to as anchors.

In torchvision, the initialization of these anchors are defined in the AnchorGenerator. For the proposed use case, the RPN only needs to consider a single object, so the size and ratios of anchors may be defined according to the target object. For example, the bounding box of Pikachu with his tail usually needs a rectangle with ratios of height and width close to 1. Therefore the aspect_ratios may be modified from the default (0.5, 1.0, 2.0) into (0.75, 1.0, 1.5) — or even just (1.0), which shows similar results in the paper with the COCO dataset.

During training, not all the anchors will be used in calculating the loss function of the RPN. Only those with IoU (Intersection-over-Union) with a ground-truth box larger/smaller than the threshold fg_iou_threshold/bg_iou_threshold are considered as positive/negative anchors and have possible contribution in the loss function. Anchors that are neither positive nor negative have no influence on training.

To prevent negative anchors from being overly dominant, only a certain number of anchors are sampled to calculate their loss. This number is defined by the parameter batch_size_per_image, and the sample pool is further constrained by the positive_fraction parameter to request a certain minimum number of positive samples. Per the source code, when no positive anchors are found (such as with intentionally added negative samples), the number of proposals used in the loss function will be the batch_size_per_image.

The Faster R-CNN RPN frequently generates overlapping proposals. To reduce redundancy, it applies a Non-Maximum Suppression (NMS) operation to only retain the proposals with an objectness score higher than 0.7. After NMS, the top N proposals (2,000 in the original work) are passed to the final classifier, Fast R-CNN. All default parameters for the number of anchors and proposals are tuned for multiple object detection. The proposed scenario, however, includes only one or a few targets in a single image, so these parameters may be lowered to reduce the number of proposed regions and speed up training. The code snippet below shows a suggested RPN to accomplish this.

Fast R-CNN

After obtaining the region proposals from the RPN, the classifier (Fast R-CNN) operates with the extracted feature maps to inspect the features in these regions. Its goal is to predict final objectness scores and to refine the bounding boxes on the target objects. This classifier needs an ROI pooling layer, proposed in the previous Fast R-CNN work, to extract features with different sizes of input region and then output fixed-sized feature maps. The classifier is called roi_heads in the API.

The following snippet illustrates how to replace the default classifier with one dedicated to a specific scenario. With a single-class scenario, the number of regions may be reduced, and the snippet illustrates how to accomplish this as well.

Training The Model

After building the model and dataset, the next step is to properly train it. The official tutorial employs helper functions that wrap all the steps of training each epoch, including learning rate warmup (in the first epoch only), updating model weights, and logging of losses. The helper functions print out the loss for each print_freq iteration. The metric_logger from each epoch may be used to plot the loss along training iterations, helping visualize whether the model is converging well.

The above figure demonstrates that the loss is bouncing around and not going down, suggesting the default learning rate of 0.005 might be too large. This leads to the model having difficulty finding the minimum of the loss function. Changing this value to 0.0005 with torch optimizer provides much better convergence, as illustrated below.

The whole training process might resemble the following snippet.

Conclusion

This article demonstrates one method to change an available Faster R-CNN model into a single-class detector with torchvision. Simplifying the task objective allows the introduction of well-defined hard negative data in the training dataset, tuning down of the anchor’s hyper-parameter settings, and a revised training process for improved model performance. However, this approach will not always work when the single-class targets are small and blurred like the application of a pedestrian detector due to the limitations of Faster R-CNN. Nevertheless, this approach should help in training other object detectors with as much flexibility as desired without needing to build a new model layer by layer.