Knowledge Distillation tool for binary classification

Published in

CodeX

3 min readAug 27, 2021

Introduction

Knowledge distillation (KD) is a method to reduce the model size while keeping its performance. This procedure consists of transferring the knowledge from a large pre-trained model which is also known as the teacher model to a new smaller model (student model). We will focus on minimizing a custom loss function to solve a binary classification problem, instead of achieving the match of softened teacher logits as well as ground-truth labels.

Use case

Real-time applications on the web browser, for example, need to be fast regarding the rendering to optimize the browsing experience of the users, in this instance KD is useful, since it reduces the number of floating operations (G-FLOPs). Although we can improve the performance of almost any machine learning algorithm by training many different models on the same data and then averaging their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, but using the KD model to resume all the models in one, can widely save resource-intensive without any decrease in performance.

Backpropagation

Here we will talk about the loss function used to do backpropagation assuming that we are doing binary classification, for example, cats and dogs classification:

Example of cats and dogs binary classification

To calculate the loss, we predict the input samples with both teacher and the student models, these predictions are then used to generate the student and distillation losses.

Student loss represents the binary cross-entropy between the student predictions and the ground truth. Whereas the distillation loss is the binary cross-entropy between the teacher and its student predictions. The general formula of the binary cross-entropy is as follow:

L = -y.log(ŷ) + (1-y).log(1- ŷ)

Hence, by combining these two losses we will have a loss function f that illustrate our task (i.e. binary classification), the following equation shows how is it calculated:

f = α .student_loss + ( 1-α) . distillation_loss

The parameter α can vary in the interval [0-1], 0.5 is not always the best solution, on the other hand, we must try a set of values included in the mentioned interval in order to find the parameter which minimizes the best our loss function.

By using this custom loss function we are likely to get promising results in minimizing the model size while maintaining its performance, to go a little further and boost the result data-centric is recommended, this consists of rigorously modifying (e.g. enhancement, remove the blurred images, … etc ) the datasets to improve the accuracy of the fixed model.

Conclusion

In the above talk, we highlight an important tool which is knowledge distillation. We also mentioned the way to use it in the binary classification task based on a custom loss function, in this link you can find the Github repository of knowledge distillation.

In the end, big thanks to the researchers who brought this tool to life.

References

[1]: Geoffrey Hinton et al. (2015) Distilling the Knowledge in a Neural Network. https://arxiv.org/abs/1503.02531

[2]: Keras Knowledge Distillation. https://keras.io/examples/vision/knowledge_distillation/

[3]: From Model-centric to Data-centric Artificial Intelligence. https://towardsdatascience.com/from-model-centric-to-data-centric-artificial-intelligence-77e423f3f593