Llama 3.1 4B with NVIDIA Minitron

A new way to compress LLMs

Benjamin Marie
2 min readAug 17, 2024

Two weeks ago, I reviewed Minitron in the Weekly Salt (free article):

NVIDIA’s method reduces the size of LLMs by pruning attention heads and shrinking both the hidden size and MLP intermediate dimension (width pruning). While Minitron can also prune layers (depth pruning), NVIDIA confirmed that this approach is more detrimental to the model’s performance.

NVIDIA applied their recipe to Llama 3.1 8B to create a 4B version.

To obtain this 4B parameter model, they pruned the MLP intermediate dimension from 14336 to 9216 and the hidden size from 4096 to 3072.

Then, they trained the resulting model with knowledge distillation, using Llama 3.1 8B as the teacher.

source

They released the model here.

They also benchmarked this 4B model:

--

--

Benjamin Marie

Ph.D, research scientist in NLP/AI. Medium "Top writer" in AI and Technology. Exclusive articles and all my AI notebooks on https://kaitchup.substack.com/