Knowledge Distillation in a neural network

7 min readMay 23, 2020

This blog elucidates the reproduction of the paper “Distilling the knowledge in a Neural Network” by G.Hinton, O.Vinyals and J.Dean using MNIST datasets as the part of the Deep Learning course curriculum for the Delft University of Technology.

A knowledge distillation is a process in which an already trained cumbersome model is used to govern a smaller model on what to do and how to do a particular task. The generalisation of a complex model is transferred to a smaller one with similar accuracy. This is exactly similar to a teacher transferring his/her knowledge to a student in a school/university. In some cases where the student has no prior experience in executing a task, however, the student executes it with the knowledge imparted by the teacher in handling similar situations. Similarly, through knowledge distillation, smaller models are trained using the knowledge acquired by training a large ensemble of models.

This is exactly similar to a teacher transferring his/her knowledge to a student in a school/university. In some cases where the student has no prior experience in executing a task, he/ she executes it with the knowledge imparted by the teacher in handling similar situations.

Fig.1: Knowledge Distillation in a Student — Teacher Model Source:https://medium.com/neuralmachine/knowledge-distillation-dc241d7c2322

Why do we need this concept?

The neural networks are often used in all the fields for data processing in an efficient way in the current scenario. However, when the amount of data to be processed is increasing, the model has to be cumbersome to process all these data. A larger model means a larger space to store data and its parameters. Since the technological trend is moving towards smaller components like mobiles, tablets etc., these models are needed to be accommodated in such devices so as to accomplish tasks like speech recognition, image identification etc. Hence the compression of a model is very important in the current scenarios without compromising on accuracy. The concept of knowledge distillation comes to our rescue during this scenario.

The concept of “knowledge transfer”:

This concept is comparable to the lectures given by the teacher to the students at a school or university. In order to transfer the knowledge from larger models to the smaller models, we train a smaller model based on the outputs received through training larger models. In a neural network, the outputs are normally hard targets or soft targets.

So what output of the neural network should we take and why?

For cumbersome models that learn to discriminate among a large number of classes, the training objective is to maximize the average log probability of the correct answers (hard targets). The trained model assigns probabilities to all of the incorrect answers and even when these probabilities are very small, some of them are much larger than others. For example, an image for a superbike may only have a small probability of being a bicycle, but this probability will be larger than the probability of a stone being classified as a bicycle.

To facilitate knowledge transfer, we use “soft targets” of the teacher model as an objective function to train the student model.

To facilitate knowledge transfer, we use “soft targets” of the teacher model as an objective function to train the student model.

What is the neural architecture used in reproducibility?

The teacher model consists of two hidden layers of 1200 rectified linear hidden units which are trained on the MNIST datasets. The student model has two hidden layers of 800 rectified linear hidden units. Also, the author has checked the influence of a smaller student model which consists of 300 units in each of the two hidden layers with distillation. In addition to that, the influence on the “temperature” hyper-parameter was studied using a student model of 30 units in its two hidden layers.

Fig 3 : Architecture of the teacher and student model

Exactly how do we train the teacher-student model?

First, we train the teacher model using the complete MNIST datasets.
Then we train the student model with the generated soft max of the teacher model and hard targets generated by the student model. The influence of the hard target is kept low using a weighing factor.

Fig 4: Training of Student-Teacher Model **Source:** Shen et al, In Teacher We Trust: Learning Compressed Models for Pedestrian Detection

As the soft targets have high entropy, they provide much more information per training case than hard targets. They also provide less variance during gradient optimisation, so the student model can often be trained on less data than the teacher model while using a much higher learning rate. This entropy is controlled by adjusting the ‘temperature’ as given below :

Fig 5 : The formula for calculating the softmax

How are we different from the paper?

The paper uses dropout while training the teacher model. In order to accommodate these dropouts, additional weight parameters were optimised to constraint the L2 norm within its upper bound. Instead, we directly used the weight decay for training our model.

We have used about 100 epochs to train the larger model. We chose this value because a large model is usually trained with more data and for a longer time. We have also used 25 epochs to train the student models. Since the author does not specify the number of epochs, we have optimised and arrived at these values for the epochs.

This could be the reason for the deviation from the exact replication of the errors obtained by the author of the paper. But still, we observed the same trend in the errors among the models and hence we were able to conclude on the concepts given in the paper.

What did we achieve through knowledge distillation?

Knowledge distillation is a process of model compression where the information from the teacher model helps the student model to classify the input images correctly.

We observed this trend in our reproduction. We have found out that the student model with distillation performs better than the student model without distillation. This is evidently displayed in the Table 1.

Table 1 : Tabular results for Models with and without distillation

Fig 6 : Comparison between teacher and student model

In order to understand the power of knowledge distillation, we use a transfer set to train student models. A transfer set is the same as MNIST dataset but with one of the classes removed. In our case, class 3 was removed from the original MNIST dataset. The teacher model was trained with the whole MNIST dataset and the student model was trained with the transfer set. This means that the student model will not have encountered the digit 3 during its training. But thanks to knowledge distillation, the student model was able to identify digit 3 with an accuracy of 97%.

The student model was able to identify digit 3 with an accuracy of 97%.

Effects of parameters on the accuracy of the student model

As discussed previously, the effect of temperature needs to have an influence on the amount of information that needs to be transferred. But we have learnt that, if the amount of units is above 100, the effect of this temperature, no matter what, plays a lesser influence in the accuracy of the model. But if the units of a neural network are decreased to be below 30, then the temperature has to be lowered in order to compensate for the loss in accuracy. The results are depicted in Table 2.

Table 2 : The effect of parameters on student model

Effects of bias on the accuracy of the student model

If the bias value for the output layer of the student model is set as 3.5, then we see that the accuracy of detecting the digit 3, when the student model is trained with the transfer set, has increased. Hence the effect of bias must be optimised to get better results in transfer sets. The result is depicted in Table 3.

Table 3 : Effect of Bias on the student model

What did we conclude at the end of this reproduction ?

We have observed the similar trend observed by the author of this paper. A knowledge distillation truly tends to increase the accuracy of the student model, even if the student model is trained with limited datasets when compared to the teacher model.

It can be seen that a really big neural network that has been trained for a very long time can be used to train smaller models by the use of knowledge distillation.

On MNIST, distillation works remarkably well even when the transfer set that is used to train the distilled model lacks any examples of one or more classes.

See this link for the code : https://github.com/prakashradhakrish/Distilling-the-knowledge-in-a-Neural-Network

Knowledge Distillation in a neural network

Written by Karthik Arvind