Learning Day 63: Object detection 2— SPP-Net
Published in
2 min readJun 17, 2021
SPP-Net
- Improve on the drawbacks of R-CNN: slow, many calculations (all region proposals (~2000) in each image will go through CNN). Ie. Image → crop/warp →conv layers →fc layers →output
- Directly feed the entire image to CNN for once and extract region features at Conv5
- Spatial Pyramid Pooling (SPP) to extract features at different size of regions
- SPP-Net flow: Image →conv layers →spatial pyramid pooling →fc layers →output
- So it takes the advantage of parameters sharing at CNN layer, and can adapt to different input size (since input size is limited by the fc layers not conv layers, in R-CNN the crop has to be warped to fulfil certain size; SPP in SPP-Net elevates this size constraint)
- How to select Region of Interest (ROI) here: Based on the feature maps at Conv5, select the strongest activations and project the bounding box backwards to the original image. That works because objects in feature maps and original images have the same relative position
SPP-Net fine-tuning procedure
- With the above SPP-Net flow as the basis
- Load pre-trained model and calculate the SPP features in all ROI, F
- Use F to fine tune only the fc layers, fc6 →fc7 →fc8 (different from R-CNN which fine tunes the conv layers as well)
- Calculate the new fc7 features and use them for SVM classifier
- Use F for bounding box regression
Remaining drawbacks inherited from R-CNN and new dragback
- Need to storage large amount of features
- Multi-phase training
- Faster than R-CNN but still quite slow
- New drawback: cannot fine tune conv layers before SPP layer