Few-shot Object Detection in Practice

Published in

Moonvision

5 min readApr 12, 2019

*Fig. 1. Object detection example for* *dishtracker* *and the highest resolution activation maps for background and 10 foreground classes.*

Object detection is vital to automate manual tasks, such as checking the completeness of objects and the exact types of its parts. In contrast to segmentation, objects are located and classified as discrete instances. This is achieved by decoding regression and activation maps after a cascade of convolutions. You can read more about state-of-the art in object detection in this survey.

However, contemporary issues in object detection are often studied in isolation. In production use cases though, multiple constraints must be solved at once. In this post, we describe the combination of techniques that we’ve developed over time that meet many of these constraints.

Problem

As with any machine learning task, the amount of training data is limited. As we will review below, there are many approaches to conquer low-data scenarios, each with its own remaining problems.

First, raw or weakly supervised data might be available plentifully (think of social media photos or passively recorded video). Such data can be mined by iterative methods: from few ground-truth samples you can generate new training samples and start retraining from the expanded set (fully supervised or weakly supervised). The remaining problem is to achieve competitive accuracy with this approach, especially regarding positions and sizes of the bounding boxes. These so called IoU metrics are often underreported in benchmarks.

Secondly, if your target domain is similar to a richly annotated source domain, you can transfer detection know-how from that source domain. Concretely, localizing objects benefits greatly from such techniques and can be done with different but often complicated means (domain transfer with regularization, meta-learning to transfer). Of course, this technique fails if there is no properly annotated source domain. More exotic approaches try to localize and classify by searching for similarity in latent space. This however has a high computational cost per class and is limited to single instances or fragile bounding shapes based on thresholding operations.

Another problem, often neglected in works that focuses on low-data regimes, are unbalanced classes. This means that there are many more examples of some classes than for others (think Catahoula Leopard Dog vs. German Shepherd). The problem cannot be fully solved by simply over- or undersampling, because rare items co-occur with frequent items in the same image.

A fourth problem is that of extensibility. If you finally have a well trained model for a set of classes, it’s hard to add new ones, even if it’s just a special case to a common category.

Fig. 2from left to right: Instance ID, trained super class (text color), detailed label in text, visualization of ROI-align features (leftmost is actual instance, right are its nearest neighbors). If the 10x10 feature map has a light stripe it’s of the same instance as on the left. Each 10x10 feature is reduced to three channels (RGB) by PCA. The model was trained on 10 super classes out of 76 highly unbalanced classes.

Our experiments in Summer 2018 showed that ROI-align based descriptors eradicate features between different instances of the same class. The visualization in Fig. 1 shows that ROI-align features from an SSD model are insensitive to differences between instances of the same super class (similar color). Thus intra-class variance cannot be inferred from such features without full retraining on the annotated hierarchy or the risk of catastrophic forgetting. Taken to the extreme, every new example can come from entirely unknown classes.

Solution

We address the remaining problems mentioned above by a pipeline that combines object mining, fast annotation and few-shot classification.

Object mining

Our domains are often new-to-the-world. Thus the most generic way to acquire raw data is video. We use a pretrained model to mine objects if there is a similar one. On an entirely new domain, hand-engineered detectors or motion patterns can be used instead. In one case, we processed 600GB of videos and produced over 12.000 object candidates automatically to bootstrap that process. Unlike seeding some ground truth examples only in the beginning, we ask experts on demand for a few examples before iterating that process. This provides a way to add details that don’t exist anywhere yet and allows to evaluate models with high rigor. Thanks to our recommendation system, this annotation process is also fast.

Few-show detection

As seen above, fast object detectors don’t cope well with unbalanced or evolving data. Once we possess the initial training set in the form of shapes and labels, we automatically divide the hierarchy into groups that have enough in common to be determined at first sight and those which differ in subtle differences. Instances of the latter kind are then classified by additional layers at higher resolution.

These layers have the objective to transform instances with the same visual attributes to a low dimensional vector within a tight volume in euclidean space or onto the patch of a hypersphere. Conversely, instances with different attributes occupy a distant volume [Metric learning paper].

With such a metric, you can easily classify unseen instances by nearest-neighbor search, build classifiers with guaranteed convergence like SVMs or determine the novelty of incoming examples. The exact way to create these embedding layers warrants another post. However, the full body of know-how regarding unbalanced classification can be utilized rather easily.

As a result, the combination of smart object mining and few-shot detection has the following properties:

Effective use of transfer learning, starting from unlabelled video
Robust against class imbalance and overfitting
Tractable for realtime use
Ability to update classes in near realtime
Trace back a classification result to a single instance in the training data for better model introspection

Applications and Outlook

The problem of detecting a quickly rotating set classes was used in many applications. In the case of counting dishes, meals and some items appear only rarely and change daily. Here, having just a few (as low as 1) examples for training is paramount to automate the checkout process. Similarly, for our automated checkout solution at Sacher, we were able to accommodate over 50 items from just 2h of raw video data. Finally, many industrial assembly tasks require checking an ever changing catalogue of items.

The approach outlined above allows us to focus computation on the details that matter whilst learning continuously. Moreover, the pipeline is very flexible, so that we were able to integrate ever more advanced techniques like generative adversarial training (GAN) easily. Many more improvements are on the way.