Few Shot Object Detection

Sai Sree Harsha
OffNote Labs
Published in
6 min readFeb 14, 2021

In this article we will discuss the task of few shot object detection from 2D images. We will first look at what exactly the task of few shot object detection is, and in the remaining two sections discuss two important approaches for the task, namely meta-learning and fine-tuning.

What is few shot object detection ?

Our ability to train machine learning models that generalise to novel concepts without abundant labeled data is still far from satisfactory when compared to human visual systems. Even an infant can easily comprehend new concepts with very few instructions. This ability to generalise from only a few labelled examples is called few-shot learning. Few-shot learning has become a key area of interest in the machine learning community and is an important yet unsolved problem in computer vision.

In the few-shot learning setting, we have a partition of the classes as base and novel classes with many labelled samples for the base classes and only a few labelled samples for novel classes. Specifically, in a K-shot object detection task only K labelled bounding boxes are available for each of the novel classes. The goal is to transfer the knowledge learned on the base classes with abundant samples to under-represented novel classes, so that the model is able to effectively detect objects belonging to novel classes, even though it has seen only K instances of a given novel class during training.

Meta-Learning approaches

A popular solution to the few-shot learning problem is meta-learning, where a meta-learner is designed to parameterise the optimization algorithm or predict the network parameters by “learning to learn”. These meta-learning strategies use simulated few-shot tasks, by sampling from base classes during training, in order to learn the mechanism of how to learn from the few examples in the novel classes. However, much of this work has focused on basic image classification tasks. Unlike image classification, object detection requires the model to not only recognise the object types but also localise the targets among millions of potential regions. This additional task significantly increases the overall complexity.

The general deep-learning models for object detection can be divided into two groups: proposal-based (two-stage) methods and direct (one-stage) methods without proposals. While the R-CNN series and FPN fall into the former line of work, the YOLO series and SSD belong to the latter. In meta-learning approaches, in addition to the base object detection model that is either single-stage or two-stage, a meta-learner is introduced to acquire class-level meta knowledge and help the model generalise to novel classes. This can be acheived through feature re-weighting, such as in FSRW [link] and Meta R-CNN [link], or class-specific weight generation, such as in MetaDet [link].

As shown in Figure 1 the training procedure is also split into a meta-training stage, where the model is only trained on the data of the base classes, and a meta fine-tuning stage, where the support set includes the few examples of the novel classes and a subset of examples from the base classes.

Figure 1: Abstraction of the meta-learning based few-shot object detectors.

The base object detector and the meta-learner are often jointly trained using episodic training. Each episode is composed of a supporting set of N objects and a set of query images. The support images and the binary masks of the annotated objects are used as input to the meta-learner, which generates class re-weighting vectors that modulate the feature representation of the query images.

Fine tuning based approach

A recently proposed method for the task of few-shot object detection, called Frustratingly Simple Few-Shot Object Detection [link], involves a Two-stage Fine tuning Approach (TFA). They adopt the widely used Faster R-CNN, a two-stage object detector, as their base detection model. As shown in Figure 2, the feature learning components of a Faster R-CNN model include the backbone, the Region Proposal Network (RPN), as well as a two-layer fully-connected sub-network which acts as a proposal-level feature extractor. There is also a box predictor composed of a box classifier to classify the object categories and a box regressor to predict the bounding box coordinates. Intuitively, the backbone features as well as the RPN features are class-agnostic. Therefore, features learned from the base classes are likely to transfer to the novel classes without further parameter updates. A key component of the method is to separate the feature representation learning and the box predictor learning into two stages.

Figure 2: Illustration of the two-stage fine-tuning approach

In the first stage, the whole object detection model is trained only on the base classes, with three losses, one applied to the output of the RPN to distinguish foreground from backgrounds and refine the anchors, a cross-entropy loss for the box classifier, and a smoothed L1 loss for the box regressor.

In the second stage, a small balanced training set with K shots per class is created, containing both base and novel classes. The weights of the box prediction networks are randomly initialised and only the box classification and regression networks, namely the last layers of the detection model are trained, while keeping the rest of the model fixed. The same losses used in the first stage are used, but with a smaller learning rate.

The paper also considers using a classifier based on cosine similarity in the second fine-tuning stage. It is found empirically that the instance-level feature normalisation used in the cosine similarity based classifier helps reduce the intra-class variance and improves the detection accuracy of novel classes. Additionally, it leads to minimal decrease in the detection accuracy of base classes when compared to a FC-based classifier, especially when the number of training examples is small.

Results and comparison

The few-shot detection performance (mAP50) of different models on the PASCAL VOC dataset is shown in Table 1 below. The performance is evaluated on three different sets of novel classes.

The Two stage Fine-tuning Approach (TFA) is compared with the meta-learning approaches FSRW, Meta-RCNN and MetaDet together with the other baseline fine-tuning based approaches, namely 1) FRCN/YOLO+joint, denotes joint training, where the base and novel class examples are jointly trained in one stage, 2) FRCN/YOLO+ft-full, denotes fine-tuning the entire model, where both the feature extractor and the box predictor are jointly fine-tuned until convergence in the second fine-tuning stage and 3) FRCN/YOLO+ft, which denotes fine-tuning the entire model with a lower number of iterations.

We see that the Two-stage Fine-tuning Approach (TFA) consistently outperforms the baseline methods by a large margin (about 2∼20 points), especially when the number of shots is low. Here FRCN stands for Faster R-CNN and TFA w/ cos is the TFA with a cosine similarity based box classifier.

Table 1: Few shot object detection performance for novel classes on the Pascal VOC dataset

Table 2 shows the average AP and AP75 of the 20 novel classes on the COCO dataset. AP75 means matching threshold is 0.75, a more strict metric than AP50. Again, the TFA consistently outperforms previous methods across all shots on both novel AP and novel AP75.

Table 2: Few shot object detection performance for novel classes on the COCO dataset

An added advantage of the TFA is that it is more memory efficient. While the episodic learning used in meta-learning approaches can be very memory inefficient as the number of classes in the supporting set increases, the fine-tuning method only fine-tunes the last layers of the network with a normal batch training scheme, which is much more memory efficient. The fine-tuning approach establishes new state of the art performances particularly in the hard 1-shot and 2-shot scenarios, out-performing all the prior meta-learning based approaches. This indicates that feature representations learned from the base classes might be able to transfer to the novel classes and simple adjustments to the box predictor can provide strong performance gain.

Conclusion

In the recent work on few-shot object detection, we broadly find two kinds of approaches, namely the meta-learning approaches and the fine-tuning approach. The methods proposed initially were based on meta-learning, modifying existing object detection pipelines with inspiration from meta-learning on image classification. However, the more recently proposed fine-tuning approach is not only more memory efficient, but also out-performs all the prior meta-learning based approaches.

--

--