Wider Perspective on the Progress in Object Detection

Idan Bassuk
techburst
Published in
4 min readOct 23, 2017

TL;DR:

Object Detection is one of the most mature fields in Computer Vision and Deep Learning. In the last year alone we have seen many novel ideas in Object Detection which introduced very significant improvements in detection accuracy.

I’ve gathered the 9 most important and useful papers (since October 2016) in my opinion for a talk I’ve recently given in several conferences. I thought it might be helpful for many other practitioners as well to get a wider perspective of the amazing progress in the field, and have the list of papers with links to Arxiv.

If you’re interested in a getting more thorough introduction to Object Detection, and in getting a solid intuition to the leading algorithms, you’re welcome to listen to my talk from the PyData Conference about the subject.

The 9 most important and useful papers, since October 2016

I’ve divided them by the module they improved (architecture [feature extractor], meta-architecture [detection algorithm] and post processing), although the division between architecture and meta-architecture improvements can be sometimes argued.

Would love to hear your thoughts about this list, and of course feel free to tell me if you think something significant is not on the list (I did not include new “feature extractor” generations on purpose, such as “squeeze-and-excitation”, which are not a unique contribution for detection, even though most of them do improve detection accuracy).

Short paragraph about each of the improvements for the busy people -

  • Detection without Pre-Training — demonstrated comparable performance to state of the art in certain cases without pre-training on Imagenet classification. This is a very important direction since the datasets in many domains are very different from Imagenet, and usually don’t benefit from pre-training.
  • Deformable Convolutions and ROI-Pooling — Enables the 3x3 convolution kernel to have any shape (non-rectangular), and learns the optimal shape from the data. Used in the entry that won 2nd place in COCO detection 2016. In the image below you can see on the right how the attention field is shaped in a way that is much more relevant for extracting relevant features.
  • Focal Loss — Novel loss function that gives a higher weight to hard-examples. Until now easy patches have overwhelmed the loss calculation and limited the extent of generalizing to hard example. This contribution enabled relatively fast single-stage detection algorithms (similar to SSD) to demonstrate the best single-model detection performance to date, surpassing all two-stage methods (and most importantly Faster R-CNN) for the first time.
Focal Loss outperforms every previous single-model in accuracy and speed
  • Multi-Task Learning — Improve the detection accuracy and increase the algorithm’s usability, by having a single network learn both detection and instance-segmentation, which are both done in a fully-convolutional (efficient) manner. Won first place in the COCO instance segmentation 2016 and 3rd place in detection.
Example Instance Segmentation Results
  • Feature-Pyramid Networks — Most effective way demonstrated to date to combine feature maps of several depths, for improving detection of smaller objects (utilizing the more fine-grained features in shallower layers). Used in current best single model mentioned above.
  • Detection on 9,000 Classes — The COCO detection dataset contains only 80 object categories, scaling up the number of classes is very expensive. They introduced a clever way to train the detection algorithm on Imagenet classification in parallel to training on detection, and enable detection on 9,000 classes in real time (currently with relatively low accuracy).
  • Soft NMS — Improve the traditional detection post processing method (NMS) to better detect different objects that partially overlap with each other.
Traditional VS Soft NMS
  • Learned NMS — Detection algorithms have benefited greatly from moving towards and end-to-end architecture. NMS is currently one of the last components of the detection algorithms which is not learned end-to-end, and this paper proposes a way to change it.

--

--