Shrinking the Giants: How knowledge distillation is Changing the Landscape of Deep Learning Models

5 min readAug 9, 2023

By Ishara Neranjana — Associate Machine Learning Engineer

Deep learning models have transformed how we think about artificial intelligence, but as they get more extensive and more complicated, it gets harder to use them in practical applications.

One of the most significant pre-trained models is GPT-4, which contains 100 trillion parameters and is trained on vast text data.

But what if we could combine the strength of large models with the portability of small ones to have the best of both worlds? One of the keys to opening this door is knowledge distillation.

What is knowledge Distillation?

Knowledge distillation is a machine learning technique used to scale down the complexity and size of a model without significantly degrading the accuracy. Knowledge distillation is essential for reducing the size of large language models (LLMs) while maintaining or improving accuracy.

LLMs are typically very large and computationally expensive, making them challenging to deploy locally. Knowledge distillation can create smaller, more efficient models that can be deployed on a broader range of devices.

This method is crucial in TinyML (machine learning models on tiny devices) applications, where model size and computational complexity are vital considerations. Knowledge distillation creates new possibilities in TinyML and model optimization for any application field by reducing the size of the deep learning behemoth.

*The generic framework for knowledge distillation (source)*

The limited resources on small devices are one of TinyML’s biggest problems. Large and complicated machine learning models may be challenging for these devices to run on due to their generally low memory, processing speed, and battery life. The models’ size, complexity, and precision must be decreased while keeping their accuracy to get around this restriction. One of the best methods for achieving this objective is knowledge distillation.

Using knowledge distillation, a smaller model called the student model is trained to replicate the predictions of a bigger, more intricate model called the teacher model. The teacher model’s predictions are used as the goal labels during the training of the student model. This technique keeps the student model’s small size and computational complexity while allowing it to benefit from the instructor model’s expertise.

How knowledge distillation works

The basic process involves training the student model on a dataset and using the teacher model to generate soft targets, which are a set of probability distributions over the output classes. These soft targets are used as the target labels for training the student model, in addition to the one-hot encoded true labels. This minimizes the difference between the teacher’s soft targets and the student’s predictions.

*The small model learns from the guidance of the large model (source*)

The cross-entropy loss, which gauges the disparity between two probability distributions, is frequently used to determine this difference. The soft targets are given a lower weight in the loss function than the true labels to keep the student model from becoming overly close to the instructor model.

There are three types of knowledge distillation techniques:

Offline distillation

This is a knowledge transfer technique where the pre-trained teacher network remains frozen while the student network is trained. This method focuses on improving the knowledge transfer mechanism, with less attention given to the teacher network architecture. A probabilistic approach for knowledge transfer is proposed in this paper, matching the probability distribution of data in the feature space rather than their actual representation. This enables cross-modal knowledge transfer and the transfer of knowledge from handcrafted feature extractors into neural networks.

Online distillation

This addresses scenarios where a large pre-trained teacher model is not available. In contrast to offline distillation, the teacher and student networks are trained simultaneously in this approach. This paper presents an online mutual knowledge distillation method where sub-networks and a fusion module are learned through mutual teaching via knowledge distillation. The process involves fusing sub-network features, facilitating efficient knowledge transfer.

Self-distillation

Conventional knowledge distillation faces challenges related to the selection of the teacher model and potential accuracy degradation in the student models during inference. Self-distillation offers a solution where the same network acts as both teacher and student. Attention-based shallow classifiers are attached after intermediate layers of the neural network, and the deeper classifiers serve as teachers during training, guiding student models through a divergence metric-based loss and L2 loss on feature maps. In the inference phase, all additional shallow classifiers are dropped.

Why knowledge distillation is important

The ability of knowledge distillation to transfer knowledge from a large, complex model to a smaller one is one of its main benefits. This is especially crucial for NLP applications with large language models and TinyML applications because running large models is challenging due to the constrained resources of small devices. It is possible to attain comparable accuracy with a much smaller model by transferring knowledge from a larger model to a smaller one.

The potential to enhance the generalization of the student model is another benefit of knowledge distillation. The capacity of a model to perform well on unseen data is known as generalization. A model may frequently memorize the training data when trained on a large dataset, which may result in overfitting. By training the student model on the predictions of the teacher model rather than the training data, knowledge distillation helps to reduce overfitting.

Knowledge distillation can be used to enhance the performance of a model on a particular job in addition to shrinking its size and complexity. For instance, knowledge can be transferred from a model trained on a large dataset to a model trained on a smaller dataset using knowledge distillation.

Wrapping up

In conclusion, knowledge distillation emerges as a powerful technique capable of reducing the complexity and size of machine learning models without compromising accuracy. This is particularly crucial for two important domains: TinyML and NLP with Large Language Models (LLMs).

In the realm of TinyML, the resource limitations of small devices pose significant challenges when running large and complex models. However, knowledge distillation comes to the rescue by transferring knowledge from a larger model to a smaller one. This enables the smaller model to achieve equivalent accuracy, enhancing performance while significantly lowering computing complexity. As TinyML continues to evolve rapidly, knowledge distillation is poised to become an increasingly prevalent method for improving the functionality of machine learning models on portable devices.

These models are inherently large and computationally expensive for NLP with LLMs, making their deployment in certain settings impractical. Knowledge distillation provides a remedy by enabling the creation of smaller and more efficient models that can be deployed on a wider range of devices. This becomes especially crucial for applications like chatbots, question-answering systems, and natural language generation, which require running seamlessly on mobile devices or embedded systems.

References

For more information on Data Science, Machine Learning, and AI solutions, reach out to our experts at Zone24x7.