On Distillation Knowledge from Teachers to Students

Published in

Aviation Software Innovation

9 min readNov 2, 2020

Goal: This blog aims to give our readers abetter view of improving the performance of the network by using distillation knowledge.

In recent years, the deep learning model is getting deeper and wider, which can achieve unexpected results on many different datasets. However, this also makes deploying the deep learning models harder due to its size and more impractical when it comes to inference time. There is a new line of research that targets reducing the number of parameters of the network before using it in production. There are three main ways to do this: quantization, pruning, and distillation methods. Each method uses a different approach to reduce the complexity of the model. In this blog, we will focus on distillation methods and its revolution throughout the last five to six years.

I. The First Motivation of Distillation Knowledge

[1] is one of the first works in the distillation field with the goal is to deploy the smaller model (less deep/wide/dense) to the production so that we can save the inference time. The problem with the less dense model is the performance which is nowhere compared to the deeper or wider model. The reason is that when we trained the smaller model from scratch, it will not have enough capability to learn a complicated pattern from the dataset. Therefore, The question here is that is there a way we can match the performance between the smaller and bigger model on the same level, or at least to close the gap between the two performances?

The first observation is that when we have the very deep or wide trained model on the dataset, this model can correctly predict the data from the test set. Although there are still errors due to the noise from the dataset, it is still able to figure out the complex pattern and noise in the dataset. [1] called this as dark knowledge. The question now is how we can transfer the dark knowledge from the bigger model to the smaller model. The elegant way that [1] proposed is to use dark knowledge as the guide to help the smaller model to walk through the noise of the dataset.

With the equation above, θs is the parameters for smaller or student model, and θt is the parameters for bigger or teacher model. And λ is the control hyper-parameters, which is controlled by how much signal the teacher model will affect the prediction of the student model. Moreover, L(., .) is the loss between two objects. It will be cross-entropy for classification or MSE (mean squared error) for regression. However, the loss between teacher and student models will usually be measured by KL (Kullback-Leiber)-divergence. If you want to know more about KL-divergence please refer here.

Note: It is easier to have consistent terminology. In the literature, we often refer to the bigger model as the teacher or the teacher model, and the smaller model as the student or the student model.

The result of training the student model in this way has a big lift compared to not use any knowledge distillation. The table below is the summary results for their experiments on the MNIST dataset.

The table is captured from the paper [1]

I suggest readers should refer to this paper for further information. Besides the idea of distillation knowledge, authors also did experiments about the expertise model with distillation. We will not go through this detail because it was beyond the scope of this blog.

II. Leverage Knowledge Distillation as Regularization

It is worth to note that the original idea of knowledge distillation is between the bigger model and the smaller model. In other words, we also called this process model compression. However, [2] is one of the first papers which uses knowledge distillation between the two identical models. Specifically, the teacher has a similar architecture with the student model. At this stage, the idea of knowledge distillation is not simply used for model compression, but the extra term in the equation above acts as a (learned) regularization in which the teacher model guides the student to enhance its performance. Let’s say we will do k-distillation rounds. At k-th round, the k-th model’s knowledge will distill from the (k-1)-th model. Another key in [2] is that the distillation happens for many rounds instead of just one round as demonstrated in [1]. The benefit of this is to create a mean ensemble model among teacher models. As a result, researchers in [2] reports that the performance has a huge lift compared to the simple supervised learning approach.

After this original idea, a new line of research has leveraged this idea to train the model in a multi-tasking setup such as BAM [3] and MT-DNN [4]. They both reported that they achieve competitive results on the Glue benchmark by using the knowledge distillation.

One of the new interesting papers that leverage the “dark knowledge” to enhance the performance of the student model is Circumventing Outliers of AutoAugment with Knowledge Distillation [5]. The main idea of this model is to close the gap for data augmentation like AutoAugment. The reason is that when we use the strong AutoAugment on images, the augmented images’ semantics will be destroyed. Thus, this will confuse the classifier. However, with the model in the paper [5], the teacher will be trained on light augmentation, and the student will be trained on strong augmentation with the supervision of the teacher.

III. Distillation as Self-training

The evolution of using distillation is to use the self-training as in [6] with the combination of semi-supervised learning. If you need a refresh about semi-supervised learning, you can take a look here. In [6], the authors’ goal is to improve the accuracy of ImageNet by utilizing the extra unlabeled data. To achieve a better result, they first trained the teacher model on ImageNet only. Then, the teacher model will be to predict the unlabelled data from which the student will have to match its performance. In addition, the authors used strong data augmentation as the regularization to make the student model slow down the overfitting rate when imitating the teacher’s performance. At the same time, the student model is also optimized on the supervised style with the ImageNet.

Note: the unlabelled data does not go through any data augmentation before feeding to the teacher model, but vice versa for the student model. The reason, according to the authors, is because they want a clean and easy picture for the teacher model so that we can achieve the most accurate prediction possible.

The idea also follows [2], which the student model will act as the teacher model in the next round, and we will train the new student model. An interesting fact here is that after each round, the size of the student model will be increased to obtain better performance, and can learn more difficult pattern compared to its teacher. One thing that we notice is that authors remain data augmentation throughout generations. We may obtain better results if we progressively increase the difficulty of data augmentation.

For automatic speech recognition (ASR), the same group of authors uses a similar technique [7]. However, they used SpecAugment as the strong augmentation to make the student learn in the (ASR) case.

In addition, the current research from Google has shown that the self-training [8] increases the performance of the student model under the strong augmentation, and it is more robust with the noise from the large dataset. The observation in the paper also challenges the common belief that the pre-trained models cause the degrading in the performance for the models on the enormous datasets.

IV. The new type of distillation

The new trend on knowledge distillation is the layer-wise matching performance between teacher and student model. The traditional way, as we see above, was to add an extra term into the total loss or to force the loss function between the student and the teacher to match. The new way is to close the layer-wise discrepancy in the outputs between the teacher and the student. In Go Wide, Then Narrow [9], and Patient Knowledge Distillation [10], the student models are smaller than the teacher model (less wide or less depth). To achieve the same performance, the distillation applies at the layer level. In Go Wide, Then Narrow, since the student is not as wide as the teacher model, the authors squeeze blocks of the student model between fully connected layers. The layer-wise losses are set up as the KL Divergence or mean squared error.

On the other hand, Patient Knowledge Distillation is forcing the same amount of the last layers in the original model to match with the student model, or they also used the skip pattern. Specifically, they will skip every k layers in the teacher and match it with the student’s layer. In this case, the authors used MSE on the normalized hidden layer between two models.

The results turn out that the idea of Go Wide, Then Narrow, and Patient Knowledge Distillation outperforms the vanilla knowledge distillation method.

Note: [9] has offered a nice mathematics explanation behind their method.

The table is from [9]. KD — knowledge distillation and WIN — wide and thin setup.

The table from [10] where FT — fine-tune, KD — knowledge distillation and PKD — patient knowledge distillation

Q&A

The question one may ask on the side is how small does the student can be between the performance get worst?

This question is addressed in [11]. In the paper, researchers tried many different distillation paths from very aggressive (10 layers CNN to 2 layers CNN) to less aggressive (10 to 8 to 6 .. 2 layers CNN). They showed that with the less aggressive decreasing in size, the student model achieves better accuracy than the aggressive case (see the table below)

Reference:

[1] Geoffrey Hinton, et al. Distilling the Knowledge in a Neural Network. https://arxiv.org/pdf/1503.02531.pdf

[2] Tommaso Furlanello et al. Born-Again Neural Networks. https://arxiv.org/pdf/1805.04770.pdf

[3] Kevin Clark et al. BAM! Born-Again Multi-Task Networks for Natural Language.https://arxiv.org/pdf/1907.04829.pdf

[4] Xiaodong Liu et al. Multi-Task Deep Neural Networks for Natural Language Understanding. https://arxiv.org/pdf/1901.11504.pdf

[5] Longhui Wei et al. Circumventing Outliers of AutoAugment with Knowledge Distillation. https://arxiv.org/pdf/2003.11342.pdf

[6] Qizhe Xie et al. Self-training with Noisy Student improves ImageNet classification. https://arxiv.org/pdf/1911.04252.pdf

[7] Daniel S. Park et al. Improved Noisy Student Training for Automatic Speech Recognition. https://arxiv.org/pdf/2005.09629.pdf

[8] Barret Zoph et al. Rethinking Pre-training and Self-training. https://arxiv.org/pdf/2006.06882.pdf

[9] Denny Zhou et al. Go Wide, Then Narrow: Efficient Training of Deep Thin Networks. https://arxiv.org/pdf/2007.00811.pdf

[10] Siqi Sun et al. Patient Knowledge Distillation for BERT Model Compression. https://arxiv.org/pdf/1908.09355.pdf

[11] Seyed Iman Mirzadeh et al. Improved Knowledge Distillation via Teacher Assistant. https://arxiv.org/pdf/1902.03393.pdf