Object Detection Memo from RCNN to the Latest (2020 July)

Neil Wu
LSC PSD
Published in
6 min readJul 22, 2020

TL; DR

Object detection is a relatively well studied task in Machine Learning field. However, like every other field, latest research is always based on tons of previous research.
This article is aimed to organize every representative research in object detection, and note the key features that makes them representative.
Feel free to notice me if you think there’re other important research I haven’t mentioned or important features I skipped in specific research by response.

Model Zoo

2014

R-CNN (Ross Gishick et al.)
SPPNet (Kaiming He et al.)

2015

Fast R-CNN (Ross Gishick et al.)
Faster R-CNN (Ren et al.)

2016

YOLO (Redmon et al.)
SSD (Liu et al.)
YOLOv2 (Redmon et al.)

2017

Feature Pyramid Network (Tsung-Yi Lin et al.)
RetinaNet (Tsung-Yi Lin et al.)

2018

YOLOv3 (Redmon et al.)

2019

Object as Points (Xingyi Zhou et al.) <Coming soon>
CornerNet (Hei Law et al.)<Coming soon>
CenterNet (Kaiwen Duan et al.)<Coming soon>
FCOS (Zhi Tian et al.)

2020

DETR(Nicolas Carion et al.) <Coming soon>
YOLOv4 (Alexey Bochkovskiy et al.) <Coming soon>
EfficientDet (Mingxing Tan et al.)<Coming soon>
YOLOv5 (Glenn Jocher) <Coming soon>

R-CNN

Original paper

(source)

First extract region proposal proposed by selective search, and then warped each proposed region to fixes sized image and input to CNN. Train SVM classifier and Bounding box regression model by output features of CNN. Since R-CNN input every region into CNN, CNN has to calculate ~2k(# of region proposal) for single images.

Selective search

Original paper
Estimating possible region for objects by continuously checking similar region.

SPPNet (Spatial Pyramid Pooling Net)

Original paper

SPPNet originally aimed to solve the issue that after region proposal by selective search in R-CNN, proposed region will lose its size and aspect ratio due to resizing. Instead of crop and resize images before input to CNN, SPPNet straight input whole image and use Spatial Pyramid Pooling Layer to extract and represent specific region from images. Therefore CNN in SPPNet only need to calculate once for single images.

The most important takeaway in SPPNet is that feature maps are able to represent the original images, so that region proposal don’t need to proposed before input to CNN.

Fast R-CNN

Original paper

(source)

Fast R-CNN adopt concept of SPPNet, create Region of Interest (RoI) pooling layer. Instead of crop&warp regions before input to CNN, Fast R-CNN input image at once, and extract region from last layer of CNN. From the last layer of CNN, RoI pooling layer pooling out region projection from original layer and predict classes from extract region.

RoI pooling layer

RoI pooling pooling region of interest by projection of original images. Check here for the details: https://deepsense.ai/region-of-interest-pooling-explained/

Faster R-CNN

Original paper

(source)

Instead of using a general region proposal methods like selective search, Faster R-CNN proposed Region Proposal Network(RPN) as its region proposal. RPN adopt anchor s of different scales and aspect ratio, and using regressor to modify bounding box.

Region Proposal Network

RPN slide a small network over the conv feature map output by the last shared conv layer. For every sliding window, RPN predict k anchor boxes and see whether it’s object or not. RPN starting the era of anchor box based object detection.

YOLO

Original paper

YOLO is first single-shot object detection. Main concept of YOLO is, it split original images into n*n grid, and for every grid, there’s k bounding box in charge for predict at most 1 object.
After area projection from feature map to original image, YOLO then added a fully connected layer for both classification and bounding box prediction. By doing so, YOLO can achieve object classification by one model which started one-shot object detection.

SSD (Single Shot Detection)

SSD is definitely the milestone in object detection. Unlike other previous models take 1 feature map for prediction, SSD started multiple layer extraction for predictions. By looking at different feature map, it is able to check object in different resolutions.
This multiple feature extractions is basically adopted for every later object detection models.

YOLOv2

Original paper

YOLOv2 based on structure of YOLO and made some changes showed below:

  • Adopted Batch normalization, removed dropout.
  • High Resolution Classification by changing training image size from 256*256 to 448*448.
  • Anchor boxes methodology replace original bounding box, using 9 anchor boxes.
  • Dimensions Clustering. Anchor box aspect ratios aren’t chosen by handpick anymore, now it use k-means on data to pre-chosen best ratio.
  • Direct Location Prediction. Restrict Anchor boxes prediction value and add linear activation, prevent unstable iteration in early stage.
  • Fine-Grained Features. Pass early layer features(26*26) to deeper layer(13*13)
  • Multi-scale Training. Resize image in training epoch to increase backbone robustness of different size

FPN(Feature Pyramid Network)

Original paper

Feature Pyramid is a technique to share semantic informations from deeper feature maps to shallower feature maps. This will makes predictions in shallower feature maps to have semantic stronger features.

RetinaNet

Original paper

RetinaNet is a combination of Feature Pyramid Network and novel loss function: Focal Loss. Achieve SoTA of the time.

Focal Loss

The anchor-box based detection models have a common issues. After labeling the anchor boxes, the sample size of positive boxes(box as object) and negative boxes(box as background) is extremely imbalanced. This is because anchor-box based models use numerous preset box (e.g. ~9k in SSD) to detect fix count object on all images. The sample imbalances cause negative sample dominates the back propagation, and decrease the learning rate and performance.

Focal Loss designed a weight to tuned the imbalances, and makes positive sample affect more in model.

YOLOv3

Original paper

YOLOv3 based on YOLOv2 and made some changes showed below:

  • Add logistic regression for object score while bounding box prediction.
  • Change softmax classifier to logistic classifier, due to assumption that image normally includes multiple objects instead of 1.
  • Extract multiple feature map from model , obtain higher resolution information and enable prediction across scale.
  • No more DarkNet19+ResNet, using brand new backbone DarkNet 53 instead.

Object as Points

Original paper

FCOS

Original paper

FCOS is a anchor-free object detection models, which using Feature Pyramid Network(FPN) to create feature maps, and adding head after every feature maps. The training process also using a novel loss funciton called Center-ness besides the classification loss and bounding box loss.

Center-ness

This is the suppression index of low-quality predicted bounding boxes produced by locations far away from the center of an object. Center-ness is a index describe the distance of point to the center of ground truth box, and added as a branch after feature maps.

EfficientDet

Original paper

--

--