Knowledge Distillation for Object Detection 2: (Survey) “Learning Efficient Object Detection Models with Knowledge Distillation”

Published in

Analytics Vidhya

4 min readMay 30, 2020

In 2017, Chen et. al. proposed a framework to train object detection networks with knowledge distillation [1]. Their work is meaningful because it is the first successful demonstration of knowledge distillation for the multi-class object detection problem.

Overall Structure

[1] Chen et. al. “Learning Efficient Object Detection Models with Knowledge Distillation”. NIPS2017

In this work, they adopt the Faster-RCNN [2] as the object detection framework. The key ideas are as follows.

Weighted Cross Entropy Loss
Regression with Teacher Bounds
Hint Learning with Feature Adaptation

Especially, the third idea is widely used as a basic network structure in recent papers.

Learning Objective

Weighted Cross Entropy Loss

Conventional use of knowledge distillation has been proposed for training classification networks. The student s is trained to optimize the following loss function. [1]

Classification Loss

However, unlike simpler classification problems, the detection problem needs to deal with a severe imbalance across different categories, because there are a lot more data of background compared to the data of target objects. To resolve this issue, they proposed to use larger weights to background class, in loss computation. They call this class-weighted cross entropy:

Class-weighted Cross Entropy Loss

In their experiments, they choose 1.5 for the background class, and 1.0 for others.

Regression with Teacher Bounds

Since teacher’s regression outputs can provide very wrong guidance, they proposed to use the teacher’s regression output as an upper bound. They add additional loss to student only when the student regression is worse than the one of teacher’s.

Hint Learning with Feature Adaptation

Romero et. al. proposed a new type of knowledge distillation, called hint learning [2]. In their method, a teacher’s intermediate feature map is provided as a hint, to guide student’s feature map. The loss term for this can be written as follows.

Hint Learning with Feature Adaption (We can replace L2 norm with L1)

However, in many cases, the shape of the feature maps are not exactly same, between teacher and student. To resolve this issue, Chen et. al. proposed to insert additional layer for changing shape of student’s feature map to the one of teacher’s feature map [1]. For example, when the feature layers are convolutaional, the adaptation layer can be 1x1 convolutions to match number of channels.

Additionally, there is another advantage for inserting an adaptation layer. They found that having an adaptation layer is important to achieve effective knowledge transferring, even when the number of channels in the hint and guided layers are the same.

Experiments

The overall performance is as follows. We can see that precision of all student networks (Tucker, AlexNet, VGGM) has been improved for all test datasets (PASCAL, COCO, KITTI, ILSVRC).

*Reference: Chen et. “Learning Efficient Object Detection Models with Knowledge Distillation”. *Advances in Neural Information Processing Systems*. 2017. [1]

They also tested their method to a teacher and a student pair whose architectures are same, but the input resolution is higher for the teacher. The accuracy improved for all combinations shown below.

Conclusion

They proposed a new frame work for distilling knowledge of object detection networks. By giving a teacher’s intermediate feature map as a hint to guide student’s feature extraction, they could prove that the knowledge distillation works not only for classification problems, but also object detection problems.

References

[1] Chen et. al. “Learning Efficient Object Detection Models with Knowledge Distillation”. Advances in Neural Information Processing Systems. 2017.

[2] Romero et. al. “Hints for thin deep nets”. arXiv preprint arXiv:1412.6550. 2014.