Weakly Supervised Object Detection — An End to End Training Pipeline

Understanding the paper “Towards Precise End-to-end Weakly Supervised Object Detection Network.”

Yesha R Shastri

Follow

Published in

VisionWizard

7 min readJul 8, 2020

--

Object detection is a well-known computer vision problem in which a tremendous amount of research has been done. Fully supervised object detection methods have become state-of-the-art for object detection. However, due to the inconvenience of gathering a large amount of data with accurate object-level annotations, weakly supervised object detection (semi-supervised approach) is recently seeking a lot of attention.

1. Introduction

In weakly supervised object detection [1], there are image-level annotations that determine whether an object is present or not.
This is different than the baseline supervised object detection since it contains instance-level annotations.
Usually, this is a two-phase learning procedure: 1. multiple instance learning detector and 2. fully supervised learning detector with bounding-box regression. (explained in detail in section 3)
In [1], a single end-to-end network is designed with both multiple instances learning detector and bounding box regression to get rid of the local minima problem (explained in section 2) with the two-phase approach.

Figure 1: The learning strategy comparison of existing weakly supervised object detection methods (above the solid blue line) and our proposed method (below the solid blue line). [Source: [1]]

2. Motivation

The earlier two-phase approach uses multiple instance learning to train a multiple instance learning detector, which uses CNN as a feature extractor.
The second phase uses a fully supervised detector Fast R-CNN to further refine (regress) the object locations by considering the output of region proposals (pseudo GT) from the first phase for supervision.
This two-phase approach can lead to the following local minima problem.

2.1 Local Minima Problem

Sometimes the multiple instance learning detector in the first phase begins by initiating weakly accurate bounding boxes. The detector may focus on very discriminative parts of the objects, for eg, the head of a cat.
This can cause the creation of the wrong region proposals, which are then used as pseudo ground truth (since instance-level annotations are not present; it is pseudo GT) for the next phase.
Ultimately the accurate location of the object cannot be learned in the second phase since the input is already overfitted to the wrong region.

Figure 2: (1) Detection results of MIL Detector (2) Fast R-CNN with pseudo GT from MIL detector [Source: [1]]

Therefore, the MIL detector and bounding box regressor are jointly trained so that the regressor is able to start to adjust the predicted boxes before the MIL detector focuses seriously to small discriminative parts to give inaccurate results.

3. Understanding the Basic Building Blocks

3.1 Multiple Instance Learning (MIL)

MIL is basically a variation to supervised learning which assigns a single label to a set (bag) of instances instead of labeling individual instances.
A particular bag is labeled as negative if all the instances in the bag are negative.
If at least one positive instance is present then that bag is labeled positive.

Figure 2: An example of MIL bags [Source: Link]

The MIL is a weakly learning process that selects the object predictions from the region proposals generated by some method which in [1] is the Selective Search Windows (SSW) method (section 3.3).

3.2 Fully Supervised Learning Detector (Fast R-CNN)

The architecture of Fast R-CNN comprises of a CNN, pre-trained on the ImageNet weights is used for feature extraction.
The final pooling layer is replaced by an ROI pooling layer, which will generate bounding boxes around the locations of objects.
The final fully connected layer is replaced by two branches: 1. Classification Branch and 2. Bounding Box Regression Branch.
The classification branch will predict the class to which the object belongs and the regression branch will make the coordinates of the bounding box more precise.

3.3 Selective Search Windows

Selective search is a region proposal algorithm used in object detection.
It is a method that uses a hierarchical grouping of similar regions based on color, texture, size, and shape.
It begins by over-segmenting the image, then adds all the bounding boxes corresponding to segmented parts to the list of region proposals then groups adjacent segments based on similarity and repeats the procedure.

Figure 4: Iterative method used by SSW [Source: Link]

4. The Method

There are three major components: guided attention module (GAM), MIL branch, and regression branch in the proposed weakly supervised object detection network [1].
An enhanced feature map is ﬁrst extracted from the CNN network with GAM from the given input image.
The ROI pooling layer from the CNN generates the region features which are later sent to the MIL branch and regression branch.
The MIL branch then proposes object locations and categories which are further taken as pseudo GT for the regression branch for location regression and classification.

Figure 4: Architecture of the proposed network from [1]. (1) Generate discriminate features using the attention mechanisms. (2) Generate the RoI features from the enhanced feature map. (3) MIL branch: Feed the extracted RoI features into a MIL network for pseudo GT boxes annotation initialization. (4) Regression branch: Feed the extracted RoI features and generated pseudo GT to the regression branch for RoI classiﬁcation and regression.

4.1 Guided Attention Module

The following is the conventional spatial neural attention structure.
The attention module takes the feature map X extracted from the ConvNet as input and a spatial-normalized attention weight map is generated as output.
The output attention map is multiplied with the original feature map X to get the attended feature. The attended feature is then added to X to get an enhanced feature map.
This will help to give more importance to relevant features and suppress the irrelevant features.
In order to trace the learning of attention weights, a classification loss is added.
To get the classification score vector, an attention map is fed to another convolutional layer and a Global Average Pooling (GAP) layer.

For detailed mathematical understanding, refer to section 3.1 from [1].

4.2 MIL Branch

MIL branch is introduced to initialize the pseudo GT annotations.
The network adopted here is Online Instance Classifier Refinement (OICR) which is based on WSDDN for its effectiveness and end-to-end training.
Classification and Detection are the two streams that are employed by WSDDN. When these two streams are combined, instance-level predictions can be achieved.
WSDDN has its own disadvantages so to further improve the performance for generating tight bounding boxes, OICR and its upgraded version Proposal Cluster Learning (PCL) are used.

4.3 Multi-Task Branch

A multi-task branch is used to operate fully supervised classification and regression after the pseudo GT annotations are generated.
There is a detection branch that has two sibling branches. The ﬁrst branch predicts a discrete probability distribution, which is computed by a softmax over the outputs of a fully connected layer.
The second sibling branch outputs bounding-box regression offsets for each of the object classes.
This works similar to a Fast R-CNN architecture.

5. Experiments and Results

5.1 Datasets and Evaluation Metrics

The datasets used for evaluation are PASCAL VOC 2007 and 2012. They comprise 9963 and 22531 images with 20 classes respectively. The train-val set used for training is 5011 images for PASCAL VOC 2007 and 11540 for PASCAL VOC 2012.
The evaluation metrics Average Precision (AP) and the mean of AP (mAP) are used to test the model on the test set. To measure the localization accuracy, Correct localization (CorLoc) is also used to evaluate the model.
PASCAL criteria: IOU>0.5 between ground truth boxes and predicted boxes is used for evaluation.

5.2 Comparison with State-of-the-Art

The mAP performance is improved by the proposed method (48.6%) over all other methods on the PASCAL VOC 2007 test set.

Figure 5: Comparison of AP performance(%) on PASCAL VOC 2007 test. The upper part shows results by using single end-to-end model. The lower part shows results by multi-phase approaches or ensemble model. [Source: [1]]

The mAP performance is improved by the proposed method (46.8%) over all other methods on the PASCAL VOC 2012 test set.

Figure 6: Comparison of AP performance(%) on the PASCAL VOC 2012 test. The upper part shows results by using a single end-to-end model. The lower part shows results by multi-phase approaches or ensemble model. [Source: [1]]

The correct localization (CorLoc) performance is improved by the proposed method (66.8%) over all other methods on the PASCAL VOC 2007 train-val set.

Figure 7: Comparison of correct localization (CorLoc) (%) on PASCAL VOC 2007 trainval. The upper part shows results by a single end-to-end model. The lower part shows results by multi-phase approaches or ensemble model. [Source: [1]]

The correct localization (CorLoc) performance is improved by the proposed method (69.5%) over all other methods on the PASCAL VOC 2012 train-val set.

Figure 8: Comparison of correct localization (CorLoc) (%) on PASCAL VOC 2012 trainval. The upper part shows results by a single end-to-end model. The lower part shows results by multi-phase approaches or ensemble model [Source: [1]]

5.3 Improvement with the Proposed Method

Figure 9: Detection results of MIL detector (left part) , Fast R-CNN with pseudo GT from MIL detector (middle part), and the proposed jointly training network (right part) [1] at different training iterations. [Source: [1]]

For detailed implementation details and results refer to Section 4 of [1].

6. Conclusion

A novel framework [1] is presented for the task of weakly supervised object detection which proves to be better than the traditional approaches in this field.
The proposed method of jointly optimizing the MIL detection and regression in an end-to-end manner achieves the desired results by eliminating the local minima problem and achieves higher accuracy on the state-of-the-art PASCAL VOC 2007 and 2012 datasets.
For better feature learning, guided attention module (GAM) is introduced. The proposed framework could also prove to be useful for future visual learning tasks.

7. References

[1] Ke Yang, Dongsheng Li, and Yong Dou. “Towards Precise End-to-end Weakly Supervised Object Detection Network”. ICCV, 2019.

[2] Maximilian Ilse, Jakub M. Tomczak, and Max Welling. “Attention-based Deep Multiple Instance Learning.” ICML, 2018.

[3] Jyoti G. Wadmare, Sunita R. Patil. “Improvising Weakly Supervised Object Detection (WSOD) using Deep Learning Technique.” International Journal of Engineering and Advanced Technology (IJEAT), 2020.

[4] https://towardsdatascience.com/fast-r-cnn-for-object-detection-a-technical-summary-a0ff94faa022

[5] https://www.learnopencv.com/selective-search-for-object-detection-cpp-python/