# Review: OverFeat — Winner of ILSVRC 2013 Localization Task (Object Detection)

There are 3 tasks in ILSVRC 2013 — Classification, localization and detection.

**OverFeat [1] completes all 3 tasks by one CNN, and won the localization task in ILSVRC (****ImageNet Large Scale Visual Recognition Competition****) 2013 [2], got rank 4 for classification task at that moment, and got rank 1 for detection task at that moment in the post-competition work.**

By the way, this is the work from Prof. Lecun’s Group, the inventor of LeNet [9], the classic deep learning research work. And it is a **2014 ICLR **paper with **more than 2000 citations **when I was writing this story. (Sik-Ho Tsang @ Medium)

**Classification**: Classify the object within the image.**Localization**: Classify the objects and localize the objects by bounding boxes within the image.**Detection**: Similar to localization that we also need to classify the objects and localize the objects by bounding boxes, but can contain small objects, also the evaluation metric is different from localization. It also needs to predict the background class when there is no objects.

### What will be covered:

- The CNN models modified from AlexNet (Fast and Accurate models)
- Fine Stride Max Pooling
- Multi-Scale Classification
- Classification Results
- Regression Network for Localization/Detection

### 1. The CNN models modified from AlexNet (Fast and Accurate models)

The above is the AlexNet [3] single-GPU version. (Please visit my review [4] if interested.) Authors modified it into **fast **and **accurate** models as below:

In simple words, these 2 models have some modifications on AlexNet but there are no big changes on the overall architecture.

For example, no local contrast normalization (LRN). The pooling layers are non-overlapping. Larger feature maps at 1st and 2nd layers due to use of smaller stride of 2. Smaller stride has been proved by ZFNet [5] that it can help to increase the accuracy by visualizing the CNN layers. (If interested, please visit my review [6].)

### 2. Fine Stride Max Pooling

There is a fine stride max pooling added to the 5th layer of the modified network.

(b) At 5th layer, **3×3 max pooling is done for multiple times** with different pixel offset,** Δx and Δy, from {0, 1, 2}.**

(c) **3×3 times max pooling are done**, 9 pooled feature maps in total.

(d) Each pooled feature map **goes through FC layers 6, 7, 8** and obtain the **output probability vector**.

(e) And reshape all vectors into a **3D output map**.

We can get the prediction if the 3D output map is averaged.

### 3. Multi-Scale Classification

Instead of using 10-view prediction as in AlexNet, OverFeat input the entire images for prediction with 6 scales as below:

6 different sizes of input images are used, resulting in layer 5 unpooled feature maps of differing spatial resolution, and thus, increase the accuracy.

At test time, FC layers become 1**×**1 conv layers. Whole image is going into the network, and obtain a class map just like VGGNet [7]. (Please visit my VGGNet review [8] if interested.)

### 4. Classification Results

Ablation study is done as above.

**AlexNet **obtains **18.2%** Top-5 error rate. **Fast model **obtain a little bit better of **17.12%** error rate according to the modifications mentioned above. (Coarse stride means using conventional max pooling.)

With **7 fast models + 4 scales + fine stride max pooling**, error rate has been greatly reduced to **13.86%**. The 7 models actually are the boosting technique or ensemble approaches which already commonly used in VGGNet, ZFNet, AlexNet, LeNet [3,5,7,9]. (If interested, please visit my reviews about these models. [4,6,8,10])

With **7 accurate models + 4 scales + fine stride max pooling, **error rate has been greatly reduced to **13.24%**.

At ILSVRC 2013, **ZFNet **obtains the best results with **11.2% error rate**. **OverFeat **obtains **13.6% error rate**, which is the same as Andrew Howard’s at **Rank 4** but it is post-competition result. And OverFeat has much better results than AlexNet, the winner in ILSVRC 2012.

### 5. Regression Network for Localization/Detection

There is a **regression network **connected at the 5th layer of the CNN. 2 FC layers (4096 and 1024 sizes) are used to have regression prediction of coordinates for the bounding box edges. An example is as follows:

(a) The regressor at this scale are 6**×**7 pixels spatially by 256 channels for each shift.

(b) 1st layer of the regression net is connected to a 5**×**5 spatial neighborhood in the layer 5 maps, as well as all 256 channels.

(c) The 2nd regression layer has 1024 units and is fully connected.

(d) The output of the regression network is a 4-vector.

Authors tried many combination, per-class regression (PCR) is not good with 44.1% error rate. **Single-class regression (SCR) and 4 scales achieves- **30.0% error rate for validation set

- and

**29.9% error rate in test set**,

- and

**won the localization task in ILSVRC 2013**.

For detection, the main differences are the evaluation metric as well as the necessity to predict a background class when there is no object. And OverFeat also got **24.3% mAP (mean average prediction) for detection task **which outperforms other approaches at the post competition moment.

Here shows some bounding boxes prediction examples. As we can see, there are many overlapping bounding boxes which waste the computations.

Nevertheless, this paper inspires many other new deep learning approaches in the fields of image classification and object detection, etc.

Later on, I will present other state-of-the-art object detection approaches. Please stay tuned.

### References

- [2014 ICLR] [OverFeat]

OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks - ILSVRC 2013 Results

http://www.image-net.org/challenges/LSVRC/2013/results.php - [2012 NIPS] [AlexNet]

ImageNet Classification with Deep Convolutional Neural Networks - Review of AlexNet, CaffeNet — Winner of ILSVRC 2012 (Image Classification)
- [2014 ECCV] [ZFNet]

Visualizing and Understanding Convolutional Networks - Review of ZFNet — Winner of ILSVRC 2013 (Image Classification)
- [2015 ICLR] [VGGNet]

Very Deep Convolutional Networks for Large-Scale Image Recognition - Review of VGGNet — 1st Runner-Up of ILSVLC 2014 (Image Classification)
- [1998 Proc. IEEE] [LeNet-1, LeNet-4, LeNet-5, Boosted LeNet-4]

Gradient-Based Learning Applied to Document Recognition - Review of LeNet-1, LeNet-4, LeNet-5, Boosted LeNet-4 (Image Classification)