Understanding Knowledge Distillation

5 min readMar 30, 2024

Knowledge distillation (KD) is a method used for training small, compact, data-efficient models. It was first proposed by Hinton et al. [1] in 2015 and soon evolved into various forms. In the rest of the article, I will first introduce standard KD as well as its variants from an academic view. Afterwards, I will explain them in a more intuitive way.

Standard Knowledge Distillation

For simplicity, I will illustrate knowledge distillation in the realm of image classification, the case in the original paper [1]. Assume we have a large, pre-trained model, called teacher model, that already achieves state-of-the-art performance on a large dataset. Our goal is to train a smaller model, called student model, that acts as excellently as the teacher model. It is tedious and time-consuming to train the student model from scratch, so let’s take advantage of the teacher model.

As a classification model, the teacher model takes as input images and outputs logits, which after softmax function, can be converted to probability for each class. Compared with ground-truth labels (also called hard targets), targets generated by the teacher model are “soft”, meaning that probabilities are distributed over all classes (even the least likely class has values). Such soft targets encode some valuable information learned by the teacher model, and can be adopted to supervise the training of the student model. For more details, please refer to the original paper [1].

Variants of Knowledge Distillation

A critical problem of standard knowledge distillation was the limitations on training very deep neural networks. To solve this problem, Romero et al. [2] suggested a hint-based training strategy that extended KD to the feature fields of deep neural networks. Concretely, student models was first trained to mimic the intermediate feature maps of the teacher model. It was then trained with the conventional knowledge distillation loss until converge. Such feature-field distillation allowed different levels of knowledge distillation and made it possible to train deeper networks.

Zagoruyko et al. [3] further improved feature-field distillation by introducing attention transfer (AT), a strategy that matched the attention maps of student models with those of teacher models. Moreover, they visualized attention maps of different layers and found some correlations between attention peaks and layer depths.

Shortly afterwards, Yim et al. [4] exploited a new method of distilling feature knowledge into student models. As opposed to previous works that distilled features of single layers, they distilled feature streams that flowed from the inputs of models to the outputs. Initially, the correlations of features across different layers were measured by flow of solution procedure (FSP) matrices. Thereafter, student model was trained to mimic the feature correlations of the teacher model, and subsequently, trained with a common supervised-learning loss.

So far, you may notice that a large, powerful teacher model is indispensable in knowledge distillation. Is it possible to train a student model without any teacher model? Actually it is! In 2018, Zhang et al. [5] proposed a novel approach free of teacher models. The idea, called deep mutual learning (DML), only required a cohort of untrained student models with similar structures. In the training phase, all student models were trained simultaneously and learned from the ground-truth labels as well as the output targets of other student models. In this way, we can train student models almost from scratch!

Even though all the aforementioned methods yielded promising results, there still existed several problems. In [6], Cho et al. claimed that “Bigger models are not better teachers” after conducting sufficient experiments. It was found that with the increase in size and depth of the teacher model, student model would suffer from severe performance degradation. Reasons might lie in the capacity difference between the two models. They then suggested two early-stopped methods to prevent from this dilemma.

Viewing KD from a naive perspective

Now let’s move on to a more naive yet more intuitive perspective. Suppose there is a math teacher A and a rookie student B in the real world. Basically, student B is “trained” to figure out a mathematic problem under the guidance of teacher A, and then attempts to tackle other similar problems on his own. Analogously, mathematic problem = task (image classification), and correct solution = ground-truth label.

In case [1], teacher A solves the problem by himself and only gives student B his final solution. Although his solution is very close to the correct one, there are no intermediate steps available. As a consequence, student B merely memorizes the solution while having no idea of how to solve the problem.

In case [2], teacher A provides not only his solution but also several intermediate steps, and student B can now mimic the steps to solve similar problems. However, the solution steps are so concise that two extra problems arise: 1. the key steps are hidden; 2. there are no connections between every two steps. Interestingly, [3] and [4] address the problems respectively. In case [3], teacher A points out which steps are vital for solving the problem, while in case [4], teacher A offers the logic for solving the problem.

In case [5], student B says, “I don’t need teacher A. I want to study with my friend C.” By coincidence, student C is also a rookie in mathematics and is willing to study with student B. Hence, they attempt to solve the same problem together, and when they fail to think of solutions, they either look at the correct solution or the solution of another person.

In case [6], teacher A is no longer an ordinary math teacher, but an expert in mathematics. He solves the problem in an advanced and abstract way, not straightforward and intuitive. Therefore, student B finds it extremely hard to understand what teacher A conveys in his solution. As a result, student B fails to learn much from teacher A.

[1] Hinton, Geoffrey E. et al. “Distilling the Knowledge in a Neural Network.” ArXiv abs/1503.02531 (2015): n. pag.

[2] Romero, Adriana et al. “FitNets: Hints for Thin Deep Nets.” CoRR abs/1412.6550 (2014): n. pag.

[3] Zagoruyko, Sergey and Nikos Komodakis. “Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer.” ArXiv abs/1612.03928 (2016): n. pag.

[4] Yim, Junho et al. “A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning.” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017): 7130–7138.

[5] Zhang, Ying et al. “Deep Mutual Learning.” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (2017): 4320–4328.

[6] Cho, Jang Hyun and Bharath Hariharan. “On the Efficacy of Knowledge Distillation.” 2019 IEEE/CVF International Conference on Computer Vision (ICCV) (2019): 4793–4801.

Understanding Knowledge Distillation

Standard Knowledge Distillation

Variants of Knowledge Distillation

Viewing KD from a naive perspective

Written by Henry Wang