Review: OverFeat — Winner of ILSVRC 2013 Localization Task (Object Detection)

There are 3 tasks in ILSVRC 2013 — Classification, localization and detection.

OverFeat [1] completes all 3 tasks by one CNN, and won the localization task in ILSVRC (ImageNet Large Scale Visual Recognition Competition) 2013 [2], got rank 4 for classification task at that moment, and got rank 1 for detection task at that moment in the post-competition work.

By the way, this is the work from Prof. Lecun’s Group, the inventor of LeNet [9], the classic deep learning research work. And it is a 2014 ICLR paper with more than 2000 citations when I was writing this story. (Sik-Ho Tsang @ Medium)

Classification: Classify the object within the image.
Localization: Classify the objects and localize the objects by bounding boxes within the image.
Detection: Similar to localization that we also need to classify the objects and localize the objects by bounding boxes, but can contain small objects, also the evaluation metric is different from localization. It also needs to predict the background class when there is no objects.

Localization (Top) and Detection (Bottom)

What will be covered:

  1. The CNN models modified from AlexNet (Fast and Accurate models)
  2. Fine Stride Max Pooling
  3. Multi-Scale Classification
  4. Classification Results
  5. Regression Network for Localization/Detection

1. The CNN models modified from AlexNet (Fast and Accurate models)

The Original AlexNet (Single-GPU Version)

The above is the AlexNet [3] single-GPU version. (Please visit my review [4] if interested.) Authors modified it into fast and accurate models as below:

Fast Model
Accurate Model

In simple words, these 2 models have some modifications on AlexNet but there are no big changes on the overall architecture.

For example, no local contrast normalization (LRN). The pooling layers are non-overlapping. Larger feature maps at 1st and 2nd layers due to use of smaller stride of 2. Smaller stride has been proved by ZFNet [5] that it can help to increase the accuracy by visualizing the CNN layers. (If interested, please visit my review [6].)

2. Fine Stride Max Pooling

There is a fine stride max pooling added to the 5th layer of the modified network.

(b) At 5th layer, 3×3 max pooling is done for multiple times with different pixel offset, Δx and Δy, from {0, 1, 2}.

(c) 3×3 times max pooling are done, 9 pooled feature maps in total.

(d) Each pooled feature map goes through FC layers 6, 7, 8 and obtain the output probability vector.

(e) And reshape all vectors into a 3D output map.

Fine Stride Max Pooling

We can get the prediction if the 3D output map is averaged.

3. Multi-Scale Classification

Instead of using 10-view prediction as in AlexNet, OverFeat input the entire images for prediction with 6 scales as below:

6 Different Scales

6 different sizes of input images are used, resulting in layer 5 unpooled feature maps of differing spatial resolution, and thus, increase the accuracy.

At test time, FC layers become 1×1 conv layers. Whole image is going into the network, and obtain a class map just like VGGNet [7]. (Please visit my VGGNet review [8] if interested.)

Test Time

4. Classification Results

Ablation Study of OverFeat

Ablation study is done as above.

AlexNet obtains 18.2% Top-5 error rate. Fast model obtain a little bit better of 17.12% error rate according to the modifications mentioned above. (Coarse stride means using conventional max pooling.)

With 7 fast models + 4 scales + fine stride max pooling, error rate has been greatly reduced to 13.86%. The 7 models actually are the boosting technique or ensemble approaches which already commonly used in VGGNet, ZFNet, AlexNet, LeNet [3,5,7,9]. (If interested, please visit my reviews about these models. [4,6,8,10])

With 7 accurate models + 4 scales + fine stride max pooling, error rate has been greatly reduced to 13.24%.

Comparison with State-of-the-art Approaches

At ILSVRC 2013, ZFNet obtains the best results with 11.2% error rate
OverFeat obtains 13.6% error rate, which is the same as Andrew Howard’s at Rank 4 but it is post-competition result. And OverFeat has much better results than AlexNet, the winner in ILSVRC 2012.

5. Regression Network for Localization/Detection

There is a regression network connected at the 5th layer of the CNN. 2 FC layers (4096 and 1024 sizes) are used to have regression prediction of coordinates for the bounding box edges. An example is as follows:

Application of the regression network to layer 5 features, at scale 2, for example.

(a) The regressor at this scale are 6×7 pixels spatially by 256 channels for each shift.
(b) 1st layer of the regression net is connected to a 5×5 spatial neighborhood in the layer 5 maps, as well as all 256 channels.
(c) The 2nd regression layer has 1024 units and is fully connected.
(d) The output of the regression network is a 4-vector.

Ablation Study for Localization in ILSVRC 2012
Comparison with State-of-the-art Approaches for Localization Task

Authors tried many combination, per-class regression (PCR) is not good with 44.1% error rate. Single-class regression (SCR) and 4 scales achieves
30.0% error rate for validation set 
- and 29.9% error rate in test set
- and won the localization task in ILSVRC 2013.

Comparison with State-of-the-art Approaches for Detection Task

For detection, the main differences are the evaluation metric as well as the necessity to predict a background class when there is no object. And OverFeat also got 24.3% mAP (mean average prediction) for detection task which outperforms other approaches at the post competition moment.

Some Bounding boxes Prediction Examples

Here shows some bounding boxes prediction examples. As we can see, there are many overlapping bounding boxes which waste the computations.

Nevertheless, this paper inspires many other new deep learning approaches in the fields of image classification and object detection, etc.

Later on, I will present other state-of-the-art object detection approaches. Please stay tuned.