Making Large Language Models Leaner
How Pruning and Distillation Are Reshaping AI Accessibility
One of the main limitations when trying to improve the performance of foundation Large language models (LLMs) is that these models are expensive to train. Methods like fine tuning and knowledge distillation are becoming popular in addressing such limitations. Another way in which such limitation can be addressed is to reduce the size of of a model by selectively removing parts of it with the goal of minimizing the impact on performance.
In “Compact Language Models via Prunning and knowledge distillation”, by Muralidharan et al. 2024, the authors investigate if pruning an existing LLM and then re-training it with a fraction of the original training data can be a suitable alternative to repeated, full retraining. In this work, the authors propose a guide to reduce the size (or compress) of LLMs that combines pruning of the depth, width, attention and pruning with knowledge distillation. They use this guide to compress the Nemotron-4 family of LLMs by a factor of 2–4×, and compare their performance to similarly-sized models on a variety of language modeling tasks.The criteria to compress the model is based on parameters that depend on a particular task.
As the capabilities of LLMs continue to grow in proportion to their number of parameters, developing and improving such models continue to be more and more limited to research groups that have the economical means to do so due to the required higher and higher costs. Approaches like these are important because allow the development and improvement of LLMs in domain specific cases be more accessible.

