Nerd For Tech
Published in

Nerd For Tech

Review — Model Distillation: Distilling the Knowledge in a Neural Network (Image Classification)

Smaller Models are Obtained Using Distillation. Faster Training for AlexNet on JFT Dataset.

Higher Temperature for Distillation
  • In this paper, the knowledge in an ensemble of models is distilled into a single model.


  1. Higher Temperature for Model Distillation
  2. Experimental Results

1. Higher Temperature for Model Distillation

1.1. Higher Temperature for Soft Targets

  • Neural networks typically produce class probabilities by using a “softmax” output layer that converts the logit, zi, computed for each class into a probability, qi, by comparing zi with the other logits:
  • where T is a temperature that is normally set to 1.
  • For example, one version of a 2 may be given a probability of 10^-6 of being a 3 and 10^−9 of being a 7 whereas for another version it may be the other way around. This is valuable information that defines a rich similarity structure over the data (i.e. it says which 2’s look like 3’s and which look like 7’s).
  • Knowledge is transferred to the distilled model by training it on a transfer set and using a soft target distribution (T>1) for each case in the transfer set that is produced by using the cumbersome model with a high temperature in its softmax.
  • The same high temperature is used when training the distilled model, but after it has been trained it uses a temperature T of 1.

1.2. The Calculation of Gradients

  • Each case in the transfer set contributes a cross-entropy gradient, dC/dzi, with respect to each logit, zi of the distilled model.
  • If the cumbersome model has logits vi which produce soft target probabilities pi and the transfer training is done at a temperature of T, The gradient is given by:
  • If the temperature is high compared with the magnitude of the logits, it can be approximated as:
  • Assuming that the logits z and v have been zero-meaned:
  • The gradient can be further simplified as:
  • It is later found that when the distilled model is much too small to capture all of the knowledge in the cumbersome model, intermediate temperatures work best.

2. Experimental Results

2.1. MNIST

  • A single large neural net with two hidden layers of 1200 rectified linear hidden units on all 60,000 training cases. Dropout is used. This net achieved 67 test errors.
  • A smaller net with two hidden layers of 800 rectified linear hidden units and no regularization achieved 146 errors.
  • If the smaller net was regularized solely by adding the additional task of matching the soft targets produced by the large net at a temperature of 20, it achieved 74 test errors.
  • When the distilled net had 300 or more units in each of its two hidden layers, all temperatures above 8 gave fairly similar results. But when this was radically reduced to 30 units per layer, temperatures in the range 2.5 to 4 worked significantly better than higher or lower temperatures.

2.2. Speech Recognition

Frame classification accuracy and Word Error Rate (WER)
  • An architecture with 8 hidden layers each containing 2560 rectified linear units and a final softmax layer with 14,000 labels (HMM targets ht) is used.
  • The input is 26 frames of 40 Mel-scaled filterbank coefficients with a 10ms advance per frame and we predict the HMM state of 21st frame.
  • The total number of parameters is about 85M.
  • To train the DNN acoustic model we use about 2000 hours of spoken English data, which yields about 700M training examples. This system achieves a frame accuracy of 58.9%, and a Word Error Rate (WER) of 10.9% on our development set.
  • The ensemble gives a smaller improvement on the ultimate objective of WER (on a 23K-word test set) due to the mismatch in the objective function, but again, the improvement in WER achieved by the ensemble is transferred to the distilled model.

2.3. JFT

Classification accuracy (top 1) on the JFT development set
  • JFT is an internal Google dataset that has 100 million labeled images with 15,000 labels.
  • AlexNet needs to be trained using 6 months. Waiting for several years to train an ensemble of models was not an option.
  • One way is to use “specialist” models, each of which is trained on data that is highly enriched in examples from a very confusable subset of the classes (like different types of mushroom).
  • 61 specialist models are trained, each with 300 classes.
  • At test time we can use the predictions from the generalist model to decide which specialists are relevant and only these specialists need to be run.
  • Please feel free to read the paper for more details.

2.4. Soft Targets as Regularizers

Frame classification accuracy and Word Error Rate (WER)
  • A lot of helpful information can be carried in soft targets that could not possibly be encoded with a single hard target.
  • With only 3% of the data (about 20M examples), training the baseline model with hard targets leads to severe overfitting.
  • The soft targets are obtained by training on the full training set.



NFT is an Educational Media House. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. To know more about us, visit

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store