Overfeat paper — Summary

Manish Chablani
5 min readAug 22, 2017

--

Original paper: https://arxiv.org/pdf/1312.6229v4.pdf

This is an old paper (2014) and has been replaced by SSD and YOLO for more realtime detections. However the paper does provide interesting perspective. On the ILSVRC 2013 dataset OverFeat ranked 4th in classification, 1st in localization, and 1st in detection.

The main point of this paper is to show that training a convolutional network to simultaneously classify, locate and detect objects in images can boost the classification accuracy and the detection and localization accuracy of all tasks. The paper proposes a new integrated approach to object detection, recognition, and localization with a single ConvNet. We also introduce a novel method for localization and detection by accumulating predicted bounding boxes.

Three computer vision tasks in increasing order of difficulty: (i) classi- fication, (ii) localization, and (iii) detection.

In the classification task of this challenge, each image is assigned a single label corresponding to the main object in the image. Five guesses are allowed to find the correct answer (this is because images can also contain multiple unlabeled objects). The localization task is similar in that 5 guesses are allowed per image, but in addition, a bounding box for the predicted object must be returned with each guess. To be considered correct, the predicted box must match the ground truth by at least 50% (using the PASCAL criterion of union over intersection), as well as be labeled with the correct class (i.e. each prediction is a label and bounding box that are associated together). The detection task differs from localization in that there can be any number of objects in each image (including zero), and false positives are penalized by the mean average precision 3 (mAP) measure.

ConvNets and Sliding Window Efficiency: In contrast to many sliding-window approaches that compute an entire pipeline for each window of the input one at a time, ConvNets are inherently efficient when applied in a sliding fashion because they naturally share computations common to overlapping regions. When applying our network to larger images at test time, we simply apply each convolution over the extent of the full image. This extends the output of each layer to cover the new image size, eventually producing a map of output class predictions, with one spatial location for each “window” (field of view) of input.

Localization

Starting from our classification-trained network, we replace the classifier layers by a regression network and train it to predict object bounding boxes at each spatial location and scale. We then combine the regression predictions together, along with the classification results at each location

Generating Predictions

To generate object bounding box predictions, we simultaneously run the classifier and regressor networks across all locations and scales. Since these share the same feature extraction layers, only the final regression layers need to be recomputed after computing the classification network. The output of the final softmax layer for a class c at each location provides a score of confidence that an object of class c is present (though not necessarily fully contained) in the corresponding field of view. Thus we can assign a confidence to each bounding box

We fix the feature extraction layers (1–5) from the classification network and train the regression network using an ℓ2 loss between the predicted and true bounding box for each example. The final regressor layer is class-specific, having 1000 different versions, one for each class. We train this network using the same set of scales as described in Section 3. We compare the prediction of the regressor net at each spatial location with the ground-truth bounding box, shifted into the frame of reference of the regressor’s translation offset within the convolution (see Fig. 8). However, we do not train the regressor on bounding boxes with less than 50% overlap with the input field of view: since the object is mostly outside of these locations, it will be better handled by regression windows that do contain the object. Training the regressors in a multi-scale manner is important for the across-scale prediction combination. Training on a single scale will perform well on that scale and still perform reasonably on other scales. However training multi-scale will make predictions match correctly across scales and exponentially increase the confidence of the merged predictions. In turn, this allows us to perform well with a few scales only, rather than many scales as is typically the case in detection. The typical ratio from one scale to another in pedestrian detection [25] is about 1.05 to 1.1, here however we use a large ratio of approximately 1.4 (this number differs for each scale since dimensions are adjusted to fit exactly the stride of our network) which allows us to run our system faster.

The final prediction is given by taking the merged bounding boxes with maximum class scores. This is computed by cumulatively adding the detection class outputs associated with the input windows from which each bounding box was predicted.

Detection

Detection training is similar to classification training but in a spatial manner. Multiple location of an image may be trained simultaneously. Since the model is convolutional, all weights are shared among all locations. The main difference with the localization task, is the necessity to predict a background class when no object is present. Traditionally, negative examples are initially taken at random for training. Then the most offending negative errors are added to the training set in bootstrapping passes. Independent bootstrapping passes render training complicated and risk potential mismatches between the negative examples collection and training times. Additionally, the size of bootstrapping passes needs to be tuned to make sure training does not overfit on a small set. To circumvent all these problems, we perform negative training on the fly, by selecting a few interesting negative examples per image such as random ones or most offending ones. This approach is more computationally expensive, but renders the procedure much simpler. And since the feature extraction is initially trained with the classification task, the detection fine-tuning is not as long anyway.

--

--

Manish Chablani

Head of AI @EightSleep , Marathoner. (Past: AI in healthcare @curaiHQ , DL for self driving cars @cruise , ML @Uber , Early engineer @MicrosoftAzure cloud