# [Review] 4. Faster R-CNN

# 1. Improvement from Fast R-CNN by introducing a Region Proposal Network

- Introduce a Region Proposal Network (RPN) and replace the external region proposal algorithms (e.g., Selective Search, EdgeBoxes), which costs 2 and 0.2 seconds per image, respectively.
- Region proposal algorithms are slow by order of magnitude as they are implemented on CPU, while the detection network is on GPU.
- RPN takes full-image convolutional features as input and produces object boundaries and their corresponding confidence scores at each grid point.
- With RPN, computing time for proposals significantly decreases from 2 (or 0.2, depends on the region proposal algorithm) sec to 10 ms.
- RPN is trained end-to-end to output region proposals, which are fed into Fast R-CNN for detection.
- In order to merge and train the RPN and Fast R-CNN in an end-to-end manner, they share the convolutional features, using the ‘attention’ mechanism.

# 2. Faster R-CNN Architecture

Faster R-CNN consists of two modules; one is the RPN, which is a fully-convolutional network producing region proposals, and the other one is the Fast-RCNN detector that takes the proposal from RPN as input and produces object detection results.

# 3. Region Proposal Networks (RPN)

RPN takes an image of arbitrary size, and it produces a set of object proposals containing the location (in a rectangular shape) and the confidence score of objectness (Foreground vs Background).

*Fig 2 *illustrates how the RPN works in a sliding window fashion. Verbalized process of RPN is as below:

(1) A certain spatial area of feature maps (window) from the previous convolution network in Fig 2 is extracted and fed into the RPN.

(2) By convolving the window with n⨯n (shared for classification and regression) convolutional layer, the intermediate feature maps are obtained. Afterward, two sibling 1⨯1 (not shared) convolutional layers are applied to retrieve the respective feature maps for objectness classification and regression.

(3) Lastly, such feature maps are fed into the spatially-shared-fully-connected layer to make the region proposals. **Note that, there are actually two FC layers at the same level (for classification and regression, respectively).**

After taking a look at the paper, one question came up to my mind:

**3.1 Does RPN actually avoids enumerating filters of multiple scales or aspect ratios?**

The answer for this question is as follows, and is well illustrated in *Fig 3*.

**(a)** is very time-consuming as it requires an image pyramid.

**(b)** applies multiple filters with different sizes to detect objects in various scales. Suppose we have N filters with different scales (e.g. 3⨯3, 5⨯5, etc.), then this costs T × N, where T is the computation time convolving feature map by a single-scale filter.

**(c)** For regression (and it is the same to apply for classification), As RPN uses the fully-connected layer, making BBOX predictions in many shapes can be done in parallel. Let’s say we want to have ** k** proposals for each grid point of the feature map. The flattened

**⨯256 feature matrix is given from the intermediate layer of RPN. To produce RPN predictions, we only need to multiply it by 256⨯4 weight matrix of FC layer for predicting**

*k***predictions in different scales or aspect ratios.**

*k*This is obvious because of the matrix multiplication: [** k**, 256] × [256, 4] = [

**, 4]**

*k*# 4. Training RPN

As RPN only predicts the presence of an object (regardless of what class it belongs to), a binary class label is assigned to each anchor whose

- IoU (Intersection over Union) with a ground-truth box is the highest than that of other anchors, or
- IoU is higher than 0.7 with any ground-truth box.

Anchors meeting either of the above conditions are considered positive while the others are negative.

where ** i** indicates an index of anchor, and

**is the predicted probability of anchor**

*p***being an object.**

*i***notes the encoded 4 coordinates representing the location of predicted BBOX, relative to the anchor**

*t***. Same notations with**

*i***imply the corresponding ground-truth.**

***Remark that *log loss* and *smooth L1* are chosen for classification and regression loss, respectively.

# 5. ROI warping

Faster R-CNN replaces the *ROI pooling* by *ROI warping*. Mainly, this is because of the following two reasons:

**due to the quantization, it loses a lot of spatial information especially when it comes to the bounding box regression.**

**Further, quantization followed by**, and therefore, the error w.r.t predicted BBOX coordinates cannot be propagated to RPN network. To address this, Faster R-CNN applied*ROI pooling*is not differentiable.*ROI Warping*, which uses the bilinear interpolation to ensure the differentiability of the pooling operation

Please see the above article and the paper for further detail on how *ROI warping* works.

# 6. Training Faster R-CNN

# Summary

# Reference

[1] Faster R-CNN

[2] Ross Girshick

[3] https://www.youtube.com/watch?v=nDPWywWRIRo

[4] https://www.youtube.com/watch?v=Jo32zrxr6l8

[6] Aridian Uman

**Any corrections, suggestions, and comments are welcome**