Model Compression: an Introduction to Teacher-Student Knowledge Distillation

Alessandro Lamberti
Artificialis
Published in
5 min readOct 28, 2022

--

Source

There’s often the misconception that bigger and bigger models are always necessarily better. For instance, GPT-3 is trained on 570 GB of text and consists of 175 billion parameters.

However, whilst training large models helps improve state-of-the-art performance, deploying such cumbersome models (especially on edge devices) is not straightforward nor very feasible.

This often leads to the development of very big deep learning models that, despite the fact they yield very good accuracy on validation datasets, often fail to meet latency, memory footprint and overall performance requirements at the time of inference.

Knowledge distillation is one of many techniques that can help overcoming these challenges, by using and ‘distilling’ the knowledge of a complex model, into a smaller model, much easier to deploy, without much loss in terms of metrics and performances on the validation data.

If you’re interested in a practical example, check out the article below!

--

--