Speed/accuracy trade-offs for modern convolutional object detectors

Huang, Jonathan, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara, Alireza Fathi, Ian Fischer, et al. 2016. “Speed/accuracy Trade-Offs for Modern Convolutional Object Detectors.” arXiv [cs.CV]. arXiv. http://arxiv.org/abs/1611.10012

Key points:

  • Provide a survey of CNN-based object detector with consistent naming and diagrams
  • Amazing graphs and analysis of performance vs. accuracy
  • 3 families of detector: Faster R-CNN, R-FCN (region-based fully convolutional network) and SSD (single-shot detector)
  • Found that using fewer proposal in Faster R-CNN will not result in a big loss in accuracy
  • Ensemble achieves state-of-the-art detection on MSCOCO

Common moving parts

Feature Extractor:

  • VGG, Inception, ResNet etc.
  • Input = image
  • Output = k, h, w feature map.
  • Can be fine-tuned in the training process

Proposal Generator (Region Proposal Network)

  • Input = feature map of the feature extractor
  • Output = (class-agonistic) bounding box (x, y, w, h) and “objectiveness” (or “class probability” for SSD)
  • Note that it does not directly predict the bounding box but the “modification” necessary to morph the “anchors” to the ground truth.
  • Loss = Difference of ground truth and prediction if the prediction is matched with a ground truth box + log loss for “objectiveness”

Architecture 1: SSD

  • Modifies the proposal generator to directly output class probability (instead of objectiveness)
  • E.g. YOLO, SSD, MultiBox
  • Pros: Very fast
  • Cons: Not good at detecting smaller object (YOLO) but using feature maps from different layers can help a lot (SSD)

Architecture 2: Faster-RCNN

  • Proposal generator: input = conv5 of the feature detector; output = bounding boxes and objectiveness
  • Box classifier: input = crop of conv5 from the bounding boxes with ROI pooling to get feature maps of fixed size; pass throuh = fc* ; output = class probability
  • Pro: best performing
  • Con: runing time depends on the number of proposal

Evolution

  • R-CNN: Selective Search -> Crop image -> CNN
  • Fast R-CNN: Selective Search -> Crop feature map of CNN
  • Faster R-CNN: CNN -> RPN (region proposal network) -> Crop feature map of CNN
  • Takeaway: end-to-end learning and multi-task loss help

Architecture 3: R-FCN

  • Similar to Faster RCNN but more efficient.
  • Address dilemma between translation-invariance in classification and translation-variance in detection. (In the other words, you want the classifcation network to output the same thing if the cat moves from the top left to bottom right, but the RPN to output diffrently)
  • Box classifier is given the crop of fc6 instead of conv5 . Computation for each proposal is reduced
  • New position sensitive score maps: shape = k*k * (C+1), h, w . So this encodes the position into the channel dimension!?
  • New position-sensitive ROI pooling: input = k * k * (c + 1), roi_h, roi_w ; pool = c + 1, k, k ; output = c+1. In the other words, top-left bin will only pool from some filters.
  • Classifier: input = feature maps
  • Pro: a variation of R-FCN (TA-FCN) won the instance segmentation challenge.

Speed — Accuracy trade off

  • Sweet spot: “Elbow” part of the mAP vs GPU time graph (i.e. R-FCN /w ResNet 100 proposals and Faster R-CNN /w ResNet, 50 proposals)
  • Input resolution affects detection accuracy of small objects
  • Good performance on small objects correlates with performance on bigger objects

Ensemble

  • Greedily select models based on performance on a held-out set
  • Explicitly encourage model diversity by skipping models that are too simiarly by checking the mAP per categories