Schedule-Free Learning — A New Way to Train Models

5 min readApr 18, 2024

Training 3 Llama models for comparison of Cosine Scheduled and Schedule-Free optimizer.

Introduction

Scheduled Learning

In the realm of machine learning, we are continuously relying on the intricate algorithms and techniques to train our models effectively. Among them, the learning rate stands out as a pivotal factor influencing the model’s convergence and performance. Traditionally, we have turned to learning rate schedulers as our trusted allies to reach the goal for optimization. There are several types of learning rate schedulers, including step decay, exponential decay, and cosine annealing. Surely, you would have come across them in the past. To give you a little idea about these learning rate (lr) schedulers; they simply help speed up training and improve model generalization. By dynamically adjusting the learning rate, they can help the model escape local minima and find a better global minimum.

Scheduled-Free Learning

However, it’s time to explore a different path — a path that frees us from the confines of learning rate schedulers i.e. Scheduled-Free Learning. With schedule-free optimizers, training is faster as there is no need to specify the stopping time/steps in advance.

Recently, facebookresearch open-sourced their schedule-free optimizers. There are 2 primary implementations currently:

SGDScheduleFree
AdamWScheduleFree

Approach

Schedule-Free learning replaces the momentum of an underlying optimizer with a combination of interpolation and averaging. In the case of gradient descent, the Schedule-free update is:

Update Step

Here x is the sequence that evaluations of test/val loss should occur at, which differs from the primary iterates z and the gradient evaluation locations y. The updates to z correspond to the underlying optimizer, in this case is a simple gradient step.

How Scheduled-Free Learning is better?

As the name suggests, Schedule-free learning does not require a decreasing learning rate schedule, yet typically out-performs, or at worst matches, SOTA schedules such as cosine-decay and linear decay. Only two sequences need to be stored at a time (the third can be computed from the other two on the fly) so this method has the same memory requirements as the base optimizer (parameter buffer + momentum).

Constraints

Some constraints that may be beneficial when using schedule-free learning to produce better results.

There is no need to use a learning rate scheduler, however the code is compatible with one
Using learning rate warmup is recommended.
This method does require tuning — it won’t necessarily out-perform a schedule approach without also tuning regularization and learning rate parameters.
For SGD, a learning rate 10x-50x larger than classical rates seems to be a good starting point.
For AdamW, learnings rates in the range 1x-10x larger than with schedule based approaches seem to work.
Training is more sensitive to the choice of betas than you may expect from standard momentum. The default of 0.9 works on most problems but it may be necessary to increase the value to 0.95 or 0.98 particularly for very long training runs.

Experiments

I have performed 3 experiments to verify the claims of scheduled-free learning. 2 experiments were performed with scheduled optimizer but different learning rates were used. In the 3 experiment, I have compared the performance of scheduled-free training with a scheduled one. You can track the performance through the provided wandb graphs using the following names

AdamW Schedule Free
AdamW Schedule Free-2
AdamW (Scheduled-cosine)

For the experiments, I leveraged 2x RTX 5000 GPUs, offering substantial computational muscle. Alongside, the LLama-60m model was used as a unifying thread across all our trials.

AdamW Schedule Free

In this experiment, I have used a Schedule-free AdamW optimizer with a learning rate of 6.0e-4, weight_decay of 0.1, Betas were set to 0.9 and 0.95.

Analyzing the graphs, it gave a throughput of 127060 tok/s, highest among the 3 experiments. Following losses were observed in this experiment

CrossEntropyLoss: 0.3302
Perplexity: 1.695
Aux_loss: 0.2605

AdamW Schedule Free-2

In this experiment, I have used Schedule-free AdamW optimizer with a learning rate of 5.0e-3 (10x less than AdamW scheduled), weight_decay of 0.1, Betas were set to 0.9 and 0.95.

Analyzing the graphs, it gave a throughput of 118487 tok/s, lowest among the 3 experiments. Following losses were observed in this experiment

CrossEntropyLoss: 0.1329
Perplexity: 1.728
Aux_loss: 0.2603

AdamW (Scheduled-cosine)

In this experiment, I have used Schedule-free AdamW optimizer with a learning rate of 5.0e-4, weight_decay of 0.1, Betas were set to 0.9 and 0.95. Warm_up steps was set to 2000, alpha_f was set to 0.1.

Analyzing the graphs, it gave a throughput of 120612 tok/s. Following losses were observed in this experiment

CrossEntropyLoss: 0.6391
Perplexity: 1.706
Aux_loss: 0.2511

Wandb Graphs

You can analyze the throughput and losses through the following graphs

Evaluation across Downstream tasks (Hellaswag, Sciq, Arc, Openbookqa, Piqa)

Conclusion

Through experiments, I have highlighted the benefits of scheduler-free learning, including its simplicity, flexibility, and ability to enhance model performance. Due to limited compute, I have only experimented with a 60M model. In terms of speed, I have observed the fastest speed (tps) in a scheduler-free optimizer with lr equal to 6.0e-5. Observing the evaluation across downstream tasks and loss curves also suggests that schedule-free optimizer and using constant lr schedule has improved convergence.

Special thanks to QueryLoopAI for sponsoring the compute of these experiments.

Also, feel free to drop me a message or:

Connect and reach me on LinkedIn and Twitter
Follow me on 📚 Medium
Subscribe to my 📢 weekly AI newsletter!
Check out my 🤗 Hugging Face