Object Detection Memo from RCNN to the Latest (2020 July)
TL; DR
Object detection is a relatively well studied task in Machine Learning field. However, like every other field, latest research is always based on tons of previous research.
This article is aimed to organize every representative research in object detection, and note the key features that makes them representative.
Feel free to notice me if you think there’re other important research I haven’t mentioned or important features I skipped in specific research by response.
Model Zoo
2014
R-CNN (Ross Gishick et al.)
SPPNet (Kaiming He et al.)
2015
Fast R-CNN (Ross Gishick et al.)
Faster R-CNN (Ren et al.)
2016
YOLO (Redmon et al.)
SSD (Liu et al.)
YOLOv2 (Redmon et al.)
2017
Feature Pyramid Network (Tsung-Yi Lin et al.)
RetinaNet (Tsung-Yi Lin et al.)
2018
2019
Object as Points (Xingyi Zhou et al.) <Coming soon>
CornerNet (Hei Law et al.)<Coming soon>
CenterNet (Kaiwen Duan et al.)<Coming soon>
FCOS (Zhi Tian et al.)
2020
DETR(Nicolas Carion et al.) <Coming soon>
YOLOv4 (Alexey Bochkovskiy et al.) <Coming soon>
EfficientDet (Mingxing Tan et al.)<Coming soon>
YOLOv5 (Glenn Jocher) <Coming soon>
R-CNN
First extract region proposal proposed by selective search, and then warped each proposed region to fixes sized image and input to CNN. Train SVM classifier and Bounding box regression model by output features of CNN. Since R-CNN input every region into CNN, CNN has to calculate ~2k(# of region proposal) for single images.
Selective search
Original paper
Estimating possible region for objects by continuously checking similar region.
SPPNet (Spatial Pyramid Pooling Net)
SPPNet originally aimed to solve the issue that after region proposal by selective search in R-CNN, proposed region will lose its size and aspect ratio due to resizing. Instead of crop and resize images before input to CNN, SPPNet straight input whole image and use Spatial Pyramid Pooling Layer to extract and represent specific region from images. Therefore CNN in SPPNet only need to calculate once for single images.
The most important takeaway in SPPNet is that feature maps are able to represent the original images, so that region proposal don’t need to proposed before input to CNN.
Fast R-CNN
Fast R-CNN adopt concept of SPPNet, create Region of Interest (RoI) pooling layer. Instead of crop&warp regions before input to CNN, Fast R-CNN input image at once, and extract region from last layer of CNN. From the last layer of CNN, RoI pooling layer pooling out region projection from original layer and predict classes from extract region.
RoI pooling layer
RoI pooling pooling region of interest by projection of original images. Check here for the details: https://deepsense.ai/region-of-interest-pooling-explained/
Faster R-CNN
Instead of using a general region proposal methods like selective search, Faster R-CNN proposed Region Proposal Network(RPN) as its region proposal. RPN adopt anchor s of different scales and aspect ratio, and using regressor to modify bounding box.
Region Proposal Network
RPN slide a small network over the conv feature map output by the last shared conv layer. For every sliding window, RPN predict k anchor boxes and see whether it’s object or not. RPN starting the era of anchor box based object detection.
YOLO
YOLO is first single-shot object detection. Main concept of YOLO is, it split original images into n*n grid, and for every grid, there’s k bounding box in charge for predict at most 1 object.
After area projection from feature map to original image, YOLO then added a fully connected layer for both classification and bounding box prediction. By doing so, YOLO can achieve object classification by one model which started one-shot object detection.
SSD (Single Shot Detection)
SSD is definitely the milestone in object detection. Unlike other previous models take 1 feature map for prediction, SSD started multiple layer extraction for predictions. By looking at different feature map, it is able to check object in different resolutions.
This multiple feature extractions is basically adopted for every later object detection models.
YOLOv2
YOLOv2 based on structure of YOLO and made some changes showed below:
- Adopted Batch normalization, removed dropout.
- High Resolution Classification by changing training image size from 256*256 to 448*448.
- Anchor boxes methodology replace original bounding box, using 9 anchor boxes.
- Dimensions Clustering. Anchor box aspect ratios aren’t chosen by handpick anymore, now it use k-means on data to pre-chosen best ratio.
- Direct Location Prediction. Restrict Anchor boxes prediction value and add linear activation, prevent unstable iteration in early stage.
- Fine-Grained Features. Pass early layer features(26*26) to deeper layer(13*13)
- Multi-scale Training. Resize image in training epoch to increase backbone robustness of different size
FPN(Feature Pyramid Network)
Feature Pyramid is a technique to share semantic informations from deeper feature maps to shallower feature maps. This will makes predictions in shallower feature maps to have semantic stronger features.
RetinaNet
RetinaNet is a combination of Feature Pyramid Network and novel loss function: Focal Loss. Achieve SoTA of the time.
Focal Loss
The anchor-box based detection models have a common issues. After labeling the anchor boxes, the sample size of positive boxes(box as object) and negative boxes(box as background) is extremely imbalanced. This is because anchor-box based models use numerous preset box (e.g. ~9k in SSD) to detect fix count object on all images. The sample imbalances cause negative sample dominates the back propagation, and decrease the learning rate and performance.
Focal Loss designed a weight to tuned the imbalances, and makes positive sample affect more in model.
YOLOv3
YOLOv3 based on YOLOv2 and made some changes showed below:
- Add logistic regression for object score while bounding box prediction.
- Change softmax classifier to logistic classifier, due to assumption that image normally includes multiple objects instead of 1.
- Extract multiple feature map from model , obtain higher resolution information and enable prediction across scale.
- No more DarkNet19+ResNet, using brand new backbone DarkNet 53 instead.
Object as Points
CornerNet
CenterNet
FCOS
FCOS is a anchor-free object detection models, which using Feature Pyramid Network(FPN) to create feature maps, and adding head after every feature maps. The training process also using a novel loss funciton called Center-ness besides the classification loss and bounding box loss.
Center-ness
This is the suppression index of low-quality predicted bounding boxes produced by locations far away from the center of an object. Center-ness is a index describe the distance of point to the center of ground truth box, and added as a branch after feature maps.