Knowledge Distillation for Object Detection 3: (Survey) “Distilling Object Detectors with Fine-grained Feature Imitation”
This paper was very impressive, because they largely improved the performance of KD for object detector, with a very simple idea. Instead of using entire feature map region, they selected near-object area for hint learning. Let me briefly review this main idea, and introduce their impressive experiment results!
Overall Structure
Overall structure of this method is very similar to the conventional knowledge distillation methods for object detectors[2]. However, they don’t use the entire guided feature map (in student), but use masked region only.
Then how can we generate the masks? For each anchor boxes in entire pixels, they compute IoU values, to get an IoU map. (Here, IoU means intersection over union, between two boxes. Please refer to Faster R-CNN paper for more details) When there are K different anchor boxes assigned to each pixels, and the feature map consist of W x H pixels, the shape of IoU map would be W x H x K. After that, they find the maximum IoU value, and multiply a constant (they used 0.5 for this constant) to it, to get a threshold value. For generating mask, they first filter out the pixels with lower IoU value than the threshold. Then they merge those pixels through out entire channels (different anchors), and all ground truth boxes (there can be more than one box in an input image).
Experiments
Table above is showing results of their experiments. They designed a one-stage toy detector, which is based on shuffle-det network. And then they reduced number of output channels in each layer, to get a smaller student models. 0.5x means number of channels reduced to half, and 0.25x means it become quarter. Here, -I indicates their fine-grained feature imitation, and -F indicates the conventional full-feature imitation. mAP decreased significantly (-8.9%), when they just produced the full region of feature map for hint learning. On the other hand, by applying mask to the feature map, mAP increased dramatically! (+6.7%)
They also showed that their method works very well for Faster R-CNN. They selected Resnet101 as a teacher and its halved model as a student. The mAP of student improved about 4% in Pascal-VOC07 dataset. Also they got 8% mAP improvement, when selected VGG16 as a teacher and VGG11 as a student. And finally, for Resnet101(teacher) with Resnet50(student), they got about 3% accuracy improvement.
Conclusion
hey showed that the spatial attention is very important for distilling knowledge to object detectors, although their might be more sophisticate ways to provide spatial attention better than their mask generation algorithm. (I am planning to survey more on this!) They also uploaded their codes on github: https://github.com/twangnh/Distilling-Object-Detectors, so you can try the experiments by yourself.
References
[1] Wang et. al. “Distilling Object Detectors with Fine-Grained Feature Imitation”. CVPR 2019.
[2] Chen et. al. “Learning Efficient Object Detection Models with Knowledge Distillation”. Advances in Neural Information Processing Systems. 2017.