Review: Spatial Pyramid Pooling[1406.4729]

Passing variable size input to CNN

Published in

Analytics Vidhya

4 min readApr 19, 2020

Credits: https://en.wikipedia.org/wiki/Object_detection

I have planned to read major object detection papers(although I have read most of them roughly I will be reading them in detail good enough to write a blog about them). The papers are related to deep learning-based object detection. Feel free to give suggestions or ask doubts will try my best to help everyone. I will write the arxiv codes of each paper below and will give a link to the blog(Will keep on updating them as I write) and their paper below. Anyone starting with the field can skip a lot of these papers. I will also write the priority/importance(according to there necessity to understand the topic) of the papers once I read them all.
I have written the blog considering readers similar to me and still learning. In case I made any mistake(I will try to minimize it by understanding paper in depth from various sources including blogs, codes, and videos) that anyone finds out feel free to highlight it or add a comment on the blog. I have mentioned the list of papers that I will be covering at the end of the blog.

Let’s get started :)

CNN’s are used for feature extraction from images followed by fully connected layers for classification. As convolution operation is applied in a sliding window fashion it can accept input of varied size, resulting in a varied size output. As CNN is followed by fully connected layers which can accept input of fixed size. This makes CNN incapable of accepting varied size inputs. Thus images are first reshaped into some specific dimension before feeding into CNN. This creates another issue of image warping and reduced resolution. Spatial Pyramid pooling comes as a counter to this problem.

Spatial Pyramid Pooling

Before the spatial pyramid pooling, the extracted feature map was usually flattened(Fully connected layer accepts input as a 1d vector) or pooled which was applied in sliding window fashion thus giving a varied size output.

Spatial pyramid pooling maintains spatial information in local spatial bins. The number of bins and their size is fixed. In each spatial bin responses of each filter are pooled. In the example image shown below, three-level pooling is done. In the paper authors used max-pooling everywhere.

Spatial Pyramid Pooling (Credits: Paper)

The output feature map has 256 filters and is of arbitrary size(depends on input size).

In the first pooling layer(Gray one in the figure), the output has a single bin and covers a complete image. This is similar to the global pooling operation. The output of this pooling is 256-d.
In the second pooling, the feature map is pooled to have 4 bins thus giving an output of size 4*256.
In third pooling, the feature map is pooled to have 16 bins thus giving an output of size 16*256.

The output of all the pooling layers is flattened and concatenated to give an output of a fixed dimension irrespective of input size.

Multi Size Training

As now our CNN is capable of using varied size inputs, authors trained network for multiple input size(they chose this to be 224*224 and 180*180). The main reason for multi-size training was to simulate varying input size. This technique worked well and showed improvement over single size training.

The paper has other experiments and training strategies that I will be skipping for the sake of simplicity and length and the purpose of this blog is to tell the techniques and not the experimentations performed in every paper.

SPPNet for Object Detection

Object detection using spatial pyramid pooling is built over RCNN architecture and I hope you guys are aware of it. In RCNN 2000 region proposals are generated and then the 2000 cropped images are passed through CNN. In SPPNet, the feature map is extracted only once per image. Spatial pyramid pooling is applied for each candidate to generate a fixed-size representation. As CNN is the most time-consuming part, SPPNet is very fast in comparison to RCNN.