Llama 3.1 4B with NVIDIA Minitron

A new way to compress LLMs

2 min readAug 17, 2024

Two weeks ago, I reviewed Minitron in the Weekly Salt (free article):

Compress LLMs with Pruning and Knowledge Distillation using Minitron

The Weekly Salt #28

thesalt.substack.com

NVIDIA’s method reduces the size of LLMs by pruning attention heads and shrinking both the hidden size and MLP intermediate dimension (width pruning). While Minitron can also prune layers (depth pruning), NVIDIA confirmed that this approach is more detrimental to the model’s performance.

NVIDIA applied their recipe to Llama 3.1 8B to create a 4B version.

To obtain this 4B parameter model, they pruned the MLP intermediate dimension from 14336 to 9216 and the hidden size from 4096 to 3072.

Then, they trained the resulting model with knowledge distillation, using Llama 3.1 8B as the teacher.

They released the model here.

They also benchmarked this 4B model:

Llama 3.1 4B with NVIDIA Minitron

A new way to compress LLMs

Compress LLMs with Pruning and Knowledge Distillation using Minitron

The Weekly Salt #28

Written by Benjamin Marie