RCNN Review[1311.2524]

Published in

Analytics Vidhya

4 min readApr 6, 2020

I have planned to read major object detection papers(although I have read most of them roughly I will be reading them in detail good enough to write a blog about them). The papers are related to deep learning-based object detection. Feel free to give suggestions or ask doubts will try my best to help everyone. I will write the arxiv codes of each paper below and will give a link to the blog(Will keep on updating them as I write) and their paper below. Anyone starting with the field can skip a lot of these papers. I will also write the priority/importance(according to there necessity to understand the topic) of the papers once I read them all.
I have written the blog considering readers similar to me and still learning. In case I made any mistake(I will try to minimize it by understanding paper in depth from various sources including blogs, codes, and videos) that anyone finds out feel free to highlight it or add a comment on the blog. I have mentioned the list of papers that I will be covering at the end of the blog.

Let’s get started :)

RCNN paper is one of the main paper that triggered research in deep learning-based object detection. RCNN improved results from the previous state of the art by 30% which is a significant improvement. Theoretically, it is a little easier to understand this paper than some other papers like overfeat discussed in the last blog.

RCNN object detection system is based on three modules. These three modules include region proposal, CNN for feature extraction and third SVM based classifier. Fig1 summarizes this network. I will brief about each module now

Region proposals

The first module of RCNN proposes regions(bounding box candidates) that might contain objects. These regions are proposed using selective search, the number of regions suggested by authors is about 2k. These regions might contain an object but we are not sure so far but the selective search approximates the location of the object and remove irrelevant background from proposals. The authors suggested other region proposal techniques that can be used but finally used a selective search.

Selective search hierarchically groups similar regions based on color, texture, size, and shapes. These regions are later grouped into multiple bounding boxes(2k in this case).

Selective search hierarchical grouping. Image credit

Feature extraction

For each region proposal, 4096 dimension features are extracted using alexnet model. All regions are resized to 227*227. Since the proposals are not of the same size, they warp all pixels in a tight bounding box around it to the required size.

Then, for each class, we score each extracted feature vector using the SVM trained for that class. Greedy non-max suppression is applied that rejects a region if it has an intersection-over-union (IoU) overlap with a higher scoring selected region larger than a learned threshold.

Training

Because of the lack of object detection data the model used was first pretrained on a larger auxiliary imagenet dataset using image-level classifications(Transfer learning). The model was later fine-tuned for the new task of detection and the new domain (warped proposal windows). Only the last layer was changed from 1000 classes in image net to 21 in pascal VOC(20+1 background). All regions with ≥0.5 IOU overlap with a ground-truth box as positives for that box’s class and the rest as negatives. Once the model is fine-tuned, a linear SVM is trained for each class.

References:

List of Papers:

OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. [Link to blog]
Rich feature hierarchies for accurate object detection and semantic segmentation(RCNN). ← You completed this blog.
Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition (SPPNet). [Link to blog]
Fast R-CNN [Link to blog]
Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. [Link to blog]
You Only Look Once: Unified, Real-Time Object Detection. [Link to blog]
SSD: Single Shot MultiBox Detector. [Link to blog]
R-FCN: Object Detection via Region-based Fully Convolutional Networks. [Link to blog]
Feature Pyramid Networks for Object Detection. [Link to blog]
DSSD: Deconvolutional Single Shot Detector. [Link to blog]
Focal Loss for Dense Object Detection(Retina net). [Link to blog]
YOLOv3: An Incremental Improvement. [Link to blog]
SNIPER: Efficient Multi-Scale Training. [Link to blog]
High-Resolution Representations for Labeling Pixels and Regions. [Link to blog]
FCOS: Fully Convolutional One-Stage Object Detection. [Link to blog]
Objects as Points. [Link to blog]
CornerNet-Lite: Efficient Keypoint Based Object Detection. [Link to blog]
CenterNet: Keypoint Triplets for Object Detection. [Link to blog]
Training-Time-Friendly Network for Real-Time Object Detection. [Link to blog]
CBNet: A Novel Composite Backbone Network Architecture for Object Detection. [Link to blog]
EfficientDet: Scalable and Efficient Object Detection. [Link to blog]

Peace…

RCNN Review[1311.2524]

Region proposals

Feature extraction

Training

List of Papers:

Written by Sanchit Tanwar