Knowledge Distillation for Object Detection 2: (Survey) “Learning Efficient Object Detection Models with Knowledge Distillation”
In 2017, Chen et. al. proposed a framework to train object detection networks with knowledge distillation [1]. Their work is meaningful because it is the first successful demonstration of knowledge distillation for the multi-class object detection problem.
Overall Structure
In this work, they adopt the Faster-RCNN [2] as the object detection framework. The key ideas are as follows.
- Weighted Cross Entropy Loss
- Regression with Teacher Bounds
- Hint Learning with Feature Adaptation
Especially, the third idea is widely used as a basic network structure in recent papers.
Learning Objective
Weighted Cross Entropy Loss
Conventional use of knowledge distillation has been proposed for training classification networks. The student s is trained to optimize the following loss function. [1]
However, unlike simpler classification problems, the detection problem needs to deal with a severe imbalance across different categories, because there are a lot more data of background compared to the data of target objects. To resolve this issue, they proposed to use larger weights to background class, in loss computation. They call this class-weighted cross entropy:
In their experiments, they choose 1.5 for the background class, and 1.0 for others.
Regression with Teacher Bounds
Since teacher’s regression outputs can provide very wrong guidance, they proposed to use the teacher’s regression output as an upper bound. They add additional loss to student only when the student regression is worse than the one of teacher’s.
Hint Learning with Feature Adaptation
Romero et. al. proposed a new type of knowledge distillation, called hint learning [2]. In their method, a teacher’s intermediate feature map is provided as a hint, to guide student’s feature map. The loss term for this can be written as follows.
However, in many cases, the shape of the feature maps are not exactly same, between teacher and student. To resolve this issue, Chen et. al. proposed to insert additional layer for changing shape of student’s feature map to the one of teacher’s feature map [1]. For example, when the feature layers are convolutaional, the adaptation layer can be 1x1 convolutions to match number of channels.
Additionally, there is another advantage for inserting an adaptation layer. They found that having an adaptation layer is important to achieve effective knowledge transferring, even when the number of channels in the hint and guided layers are the same.
Experiments
The overall performance is as follows. We can see that precision of all student networks (Tucker, AlexNet, VGGM) has been improved for all test datasets (PASCAL, COCO, KITTI, ILSVRC).
They also tested their method to a teacher and a student pair whose architectures are same, but the input resolution is higher for the teacher. The accuracy improved for all combinations shown below.
Conclusion
They proposed a new frame work for distilling knowledge of object detection networks. By giving a teacher’s intermediate feature map as a hint to guide student’s feature extraction, they could prove that the knowledge distillation works not only for classification problems, but also object detection problems.
References
[1] Chen et. al. “Learning Efficient Object Detection Models with Knowledge Distillation”. Advances in Neural Information Processing Systems. 2017.
[2] Romero et. al. “Hints for thin deep nets”. arXiv preprint arXiv:1412.6550. 2014.