Paper Summary: Fast R-CNN

4 min readNov 23, 2018

Part of the series A Month of Machine Learning Paper Summaries. Originally posted here on 2018/11/05.

Fast R-CNN (2015) Ross Girshick

With this paper we’re going to take a little step back and look at object detection, which is all about drawing bounding boxes around objects (and maybe also classifying them). This is in some ways an easier problem than the image segmentation problem that the U-Net paper (from a couple days ago) was concerned with. All we’re looking for here is an object class (so, classification) and a bounding box, which can be represented by four numbers (so, regression).

The paper uses Selective Search (Uijlings 2013) to generate a number of regions of interest (RoIs) (the specific method for generating RoIs is not the focus of this paper). Each RoI is then evaluated for whether there’s an object there, and if so what its bounding box is. This paper is based on the author’s prior work (the R-CNN from Girshick et al 2014), and results in a pretty impressive speedup. The innovation here is to share expensive early-layer computation with the help of an RoI pooling layer (more on this in a moment), which also means that the heavy use of caching features to disk is no longer necessary.

The architecture takes in an image and a set of RoIs. For each RoI we output softmax probabilities over K classes (plus a background class) and x, y, w, h adjustments to the region that determine a bounding box prediction — the exact parameterization of the bounding box numbers is somewhat complicated and is laid out in the earlier R-CNN paper. For the loss function, the paper follows the now-standard strategy of putting as much as possible into one network and combining multiple terms into a single loss. In this case there are two terms: a class loss (a log loss — how is this different from cross entropy?) and a location loss (a smoothed L1 loss summed over the 4 outputs — they went with this because L2 loss caused exploding gradients).

There’s an interesting detail here where regions that are classified as background don’t include a location loss, which makes sense considering that backgrounds don’t really have bounding boxes. (Note: if you enjoy mathematical notation, as I do, you might appreciate how they exclude location loss:

The square brackets are called Iverson brackets, a generalization of the Kronecker delta function, which was new to me. In this case the second term is 1 when u ≥ 1 and 0 otherwise. They used λ = 1.)

RoI Pooling. This is what the paper is all about, but it’s actually quite simple and is equivalent to what is now referred to as adaptive pooling. The idea is to have a max pooling layer that takes an arbitrary input region and maps it down to a fixed H×W feature map (e.g. 7×7 — this is a hyperparameter). You just divide the region into H×W subregions and do max pooling. There’s even a section on how to back-prop through the RoI pooling layer, which I found rather odd since your automatic differentiation framework of choice should be able to handle this just fine. Maybe that wasn’t true in early 2015? This is also equivalent to the spatial pyramid pooling layer from He et al 2014.

Choosing RoIs during training. This is an interesting detail that I’d like to get a better intuition for. For training they chose some regions (25% of them) to be high quality object proposals (at least 0.5 IoU against the ground truth bounding box), while the remaining 75% were between 0.1 and 0.5 IoU. The idea, it seems like, is to get a good selection of difficult edge cases to train on, an approach they call hard example mining. I wonder if you could further improve this strategy by doing curriculum learning, choosing better region proposals early in training and moving to more edge cases later on.

Some other details:

Horizontal flipping was the only data augmentation used
The learning rate for biases was twice as high as for weights (I don’t think I’d seen this before)
They used non-maximum suppression as a post-processing step
Since the fully connected layers were a performance bottleneck at high R (number of RoIs), they were able to speed things up by compressing these layers with a truncated SVD (!) (I think OpenAI’s block-sparse kernels is a similar idea)
Investigation into whether it’s necessary to do anything special (“image pyramids”) to encourage the network to learn scale invariance

There was also quite a bit of interesting discussion on ablation studies, particularly around which layers were worth fine-tuning (more than just the last layers, but leave the early ones alone), and whether more region proposals is always good (not after a certain point).

Honestly overall I found this paper comparatively disorganized and hard to follow, and many of the details, while interesting, felt extraneous. Part of the problem is likely that I’m less familiar with the work this paper builds on. I may at some point go back to the earlier R-CNN paper.

Incidentally, this Stanford CS231n convnets course lecture is a good overview of object detection methods. R-CNN is at 32:48, Fast R-CNN is at 41:21, Faster R-CNN (which I’ll cover next) is at 46:39.

Ross Girshick et al 2014 “Rich feature hierarchies for accurate object detection and semantic segmentation” https://arxiv.org/abs/1311.2524

Kaiming He et al 2014 “Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition” https://arxiv.org/abs/1406.4729

J. R. R. Uijlings et al 2013 “Selective search for object recognition” https://ivi.fnwi.uva.nl/isis/publications/bibtexbrowser.php?key=UijlingsIJCV2013&bib=all.bib

Paper Summary: Fast R-CNN

Written by Mike Plotz Sage