[Paper Summary] Knowledge Distillation — A survey


Deep learning models being huge with billions of parameters, are very difficult to deploy to devices with limited resources like phones & embedded devices or to be used for real-time inferencing or serving where the typical latency requirements are in milliseconds(≤500 ms).


  • One of the popular model compression and acceleration techniques is the knowledge distillation(KD) technique, where we transfer knowledge from a large model to a small model.
  • KD system has 3 key components: knowledge, distillation algorithm, teacher-student architecture


Different kinds of knowledge can be

Figure adapted from Jianping Gou et al. (2020) | 📝 Paper
Figure adapted from Jianping Gou et al. (2020) | 📝 Paper
Figure adapted from Jianping Gou et al. (2020) | 📝 Paper
Sources of response-based knowledge, feature-based knowledge & relation-based knowledge in a deep teacher network | Figure adapted from Jianping Gou et al. (2020) | 📝 Paper

Distillation schemes

3 main categories depending upon whether teacher and student models are simultaneously updated or not:

  • Online distillation: Teacher and student models are updated simultaneously in the same training process. We can parallelize the training here by using several distribution strategies(data and/or model), making this process efficient.
  • Self-distillation: teacher and student models are of the same size and same architecture.These can either be a single model representing teacher as well as student or we may have two instances of the same model: one being teacher and another being student. The main idea here is that knowledge from deeper layers can be used to train the shallow layers. The process usually makes the student model more robust and accurate.
Figure adapted from Jianping Gou et al. (2020) | 📝 Paper

Teacher-student architecture:

The most common student architectures are:

  • a smaller version of the teacher model with few layers and fewer neurons per layer
  • same model as the teacher

Distillation algorithms

  • Adversarial distillation: Uses the concept of Generative adversarial networks(GANs) where a generator model poses several difficult questions to the teacher model and the student model learns how to answer those questions from the teacher.
  • Multi-teacher distillation: multiple teacher models are used to provide distinct kinds of knowledge to the student model
  • Cross-modal(cross-disciplinary) distillation: The teacher model which is trained in one modality(say vision domain) is used to train a student model from a different modality(say text-domain). Example application: visual question-answering
  • Graph-based distillation²: Knowledge of the embedding procedure of the teacher network is distilled into a graph, which is in turn used to train the student model
  • Attention-based distillation: here attention maps are used to transfer knowledge about feature embeddings to the student model
  • Data-free distillation: in absence of a training dataset(due to privacy, security, etc. issues), synthetic data is generated from the teacher model or GANs are used.
  • Quantized distillation: transfer knowledge from a high-precision teacher(say 32-bit) to a low-precision student model(say 8-bit)
  • Lifelong distillation: Continuously learned knowledge of the teacher model is transferred to the student model (Youtube video reference)
  • Neural architecture search(NAS)-based distillation: Use AutoML to automatically identify the appropriate student model in terms of deciding the apt capacity gap between teacher and student models.

Performance aspects and conclusions

  • offline distillation does feature-based knowledge transfer
  • online distillation does response-based knowledge transfer
  • the performance of student models can be improved by knowledge transfer from teacher(high-capacity) models

Applications of KD

  • KD has found applications in visual recognition, NLP, speech recognition, various other applications


  • It’s a challenge to measure the quality of knowledge or quality of student-teacher architecture

Future directions

  • It would be useful to integrate KD with other learning schemes like reinforcement learning, adversarial learning, etc.


  1. Knowledge Distillation: A Survey
  2. Graph-based Knowledge Distillation by Multi-head Attention Network



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store