Knowledge Distillation — Make your neural networks smaller
Deploying a huge model with millions of parameters is not easy, what if we could transfer the knowledge to a smaller model and use it for inference? At the end of this article you will understand how it can be done.
Introduction
Knowledge Distillation means transferring the knowledge of a bigger model to a smaller one with the minimum loss of information. It can also refer to transferring the knowledge of multiple models (ensemble) into a single one.
Motivation
In most of the cases, we use the same model for training and inference. It's okay to use lots of computation power for training the model but for deployment purposes, we should be able to store the model and perform predictions faster with much fewer resources.
Knowledge
Knowledge here refers to the mapping of input images to the output vectors and that is exactly what we intend to transfer. The output probabilities not only contain the information about which class is the correct one but also contain the relative probabilities of incorrect classes.
For example: -
The probability of a BMW being mistaken for a garbage truck is very low but the same image being mistaken for a carrot is even lower, this is valuable information to understand how a model generalizes.
Method
We have two models — the teacher (bigger and cumbersome) and the student model (smaller and simpler). The teacher can also be an ensemble of multiple models. The teacher will be trained separately on a dataset and the trained model will be used to supervise the learning process of the student network.
Training the student model involves mainly 4 steps: -
1. Forward pass a data sample through the teacher model
2. Forward pass the same data sample through the student model
3. Calculate the loss which is kind of the distance between two outputs
4. Minimize this loss using backpropagation
Let P(xᵢ) and Q(xᵢ) represent the output probabilities of the teacher and student networks respectively for a data sample xᵢ. Z and Y are the unscaled output logits of the networks.
P = [p₁, p₂, …, pₖ] where k is the number of classes, the same goes for Q.
Temperature
The models usually produce output probabilities by using a softmax layer which converts the logits Z into probabilities Q. For distillation, we use a different variant of softmax with a parameter called Temperature.
Using higher temperature results in softening of the probabilities.
For example, if [0.01, 0.98, 0.01] is the hard probability, increasing the temperature can give us [0.1, 0.8, 0.1] which is a softer version and further increasing the temperature can give us something like [0.2, 0.6, 0.2].
But why temperature?
The probability of mistaking a 2 for a 3 can be 0.000001 and with a 7 can be 0.000000001. Note that the difference is 1000 times. But this information may not be transferred using cross-entropy because both the values are very close to 0. Softening the probabilities will however enable us to preserve this information and transfer it accordingly.
The use of temperature also has a regularizing effect on the student model and the model is able to generalize better.
Loss Function
The first part is the Kullback–Leibler divergence. It is calculated for the outputs of teacher and student models after applying temperature (T>1).
The second part of the loss is simply the Cross-Entropy loss. While calculating this, we use the labels and outputs of the student model without applying temperature (or say T=1).
The magnitudes of gradients produced by soft targets scale as 1/T². To ensure the relative contribution of both losses remained roughly unchanged as the temperature changed, the distillation loss was multiplied by T².
α here is a hyper-parameter which is used to balance between the two losses, in the paper α was taken as 0.5 giving equal weightage to both.
Little about KL Divergence
KL Divergence can be seen as a measure of how similar or different two probability distributions are. Mathematically it very is similar to cross-entropy, infact it is a more general form.
Refer to this link for the mathematical expression. This video gives a great intuition on the topic.
Distillation with multiple models
When using the ensemble for predictions the computation power required becomes multiple times that of a single model. KD can be used to transfer the knowledge from multiple models to a single model. The probabilities can be calculated by taking the mean of outputs of individual models (either before or after calculating probabilities) and then used in the loss function above.
Conclusion
KD is a great and ready-to-use network compression technique. Many new methods have emerged for transferring knowledge and it is still an evolving field. Compressing the networks can help solve many problems and enable us to deploy models on mobile and edge devices.
References
– Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. “Distilling the knowledge in a neural network.” arXiv preprint arXiv:1503.02531 2.7 (2015)
– Gou, Jianping, et al. “Knowledge distillation: A survey.” International Journal of Computer Vision 129.6 (2021): 1789–1819