Training Small Language Models on a Budget

Published in

Better ML

2 min readApr 20, 2024

Context:

Small language models (SLMs) are smaller versions of large language models (LLMs) with fewer parameters. SLMs have a few billion parameters (<10B), while LLMs are usually much larger.

The cost of training or finetuning SLMs are not small though! A LLAMA-7B is still 60x larger than the standard BERT model (120M).

A list of must-do experiments to reduce your training compute budget.

Trainer settings:

Mixed Precision Training: Leveraging mixed precision training on compatible hardware significantly speeds up training significantly without sacrificing accuracy. For example, use mixed precision training with fp16 / bf16 on A100 and fp8 on H100.
Early Stopping and Pruning: Identifying poorly performing configurations early (after 3 hours) and keeping only the top 50% after 12 hours (based on validation loss) optimized resource allocation.
Periodic checkpointing: Save checkpoints at specific time points (every N steps) to prevent losing compute resources for node failures/training instability.

Hyperparameters of training algorithm:

Gradient Accumulation: Accumulating gradients across multiple batches before updating weights reduces communication overhead and improves training speed.
Linear Learning Rate Decay: A learning rate that linearly decreases from a peak to zero over the training duration proved effective.
Batch size: Batch size tuning is one of the most effective ways to increase GPU utilization and thereby reducing training time.
Vocabulary size: Increasing vocab size can slow down training as you need to learn more embeddings and more sparse updates due to infrequent words.

Quick ablation tests: Understand what really matters ?

Efficient LLM training needs fail-fast and quick ablation tests to understand what optimizations impact your model quality and training throughput. By conducting enough ablation tests, one can produce efficient training recipes !

Reference:

[1] https://huggingface.co/meta-llama/Llama-2-7b

[2] How to Train BERT with an Academic Budget

Training Small Language Models on a Budget

Written by Jaideep Ray