Sparse Llama by Neural Magic, Cerebras and IST Austria

Sparse Llama: 70% Smaller, 3x Faster, Full Accuracy

3 min readMay 16, 2024

source: https://huggingface.co/papers/2405.03594

Introduction

Large language models (LLMs) are complex due to the high dimensionality of language data. Pruning, a technique to reduce model size by removing weights, disrupts the delicate relationships between these parameters in LLMs. This leads to significant accuracy loss, especially in tasks like chat and coding.

Neural Magic, Cerebras Systems and IST Austria came up with this new revolutionary approach called “Sparse fine tuning” which overcomes the limitations of the previous various pruning techniques.

The Recipe:

source: Introducing Sparse Llama: 70% Smaller, 3x Faster, Full Accuracy — Cerebras

Step 1: Sparse Pretraining

They use a technique called SparseGPT to remove unimportant parts of the model. For a 50% reduction, they remove half the model’s components uniformly across all layers.

To achieve 70% reduction, they do it in two steps:
A. Train the model with 50% reduction first.
B. Remove even more components (to reach 70% total reduction) and then continue training while keeping the removed parts inactive.

They freeze the inactive parts and enforce this sparsity throughout training to maintain the reduced size as shown in the following algorithm.

Dataset mix:

The authors chose the SlimPajama dataset and the Python subset of The Stack for pretraining their sparse models due to the filtered nature of these datasets. This filtering ensures a high proportion of high-quality tokens in the training data

Step 2: Sparse Fine-tuning

The authors built upon the SquareHead distillation approach by integrating it with their pre-trained sparse models. This integration enabled sparse fine-tuning with per-layer distillation. This combination was crucial for the models to achieve high accuracy on complex tasks, especially at higher sparsity levels. Additionally, at lower sparsity levels, the combination of sparse models and distillation was found to recover even higher accuracy than the baseline for simpler tasks

Results

Sparse Llama-2 7B Performance from Sparse pretraining

50% Sparsity: The researchers achieved 96.1% recovery of Llama Evaluation metrics with sparse pretraining.
70% Sparsity: Sparse pretraining methodology enables a notable 91.8% recovery of Llama Evaluation Metrics.

Sparse Llama-2 7B Performance from Sparse finetuning

Sparse pretrained models consistently achieved close to full recovery at both 50% and 70% sparsity across tasks. This matches or surpasses the accuracy of models created through pruning during fine-tuning.