Understanding LoRA Training, Part 1: Learning Rate Schedulers, Network Dimension and Alpha

10 min readJul 18, 2023

As the title suggests, this is an intermediate level blog post!

If you’d like to catch up, there are a few great videos for beginners. Here is one, and there is another. Well-structured written guides are also available:

Dummy local LoRA usage and local training setup guide (Windows, Nvidia)

Dummy local LoRA usage and local training setup guide (Windows, Nvidia) What's a LoRA? Are LoRAs better than Dreambooth…

rentry.co

THE OTHER LoRA TRAINING RENTRY

By yours truly, The Other LoRA Rentry Guy. This is not a how to install guide, it is a guide about how to improve your…

rentry.co

I am using kohya-ss scripts with bmaltais GUI for my LoRA training, not d8ahazard dreambooth A1111 extension, which is another popular option. From my experience, bmaltais implementation is maintained better and offers more parameters to work with.

Readme docs for kohya-ss scripts at GitHub are worth reading as well, of course:

GitHub - darkstorm2150/sd-scripts: This is a ChatGPT-4 English adaptation of the original documents…

This is a ChatGPT-4 English adaptation of the original documents by kohya-ss - GitHub - darkstorm2150/sd-scripts: This…

github.com

The goal today is to understand the role of Network Rank (Dimension) and Network Alpha parameters in character training. There will be quite a few takeways on learning rate schedulers and class (regularisation) images along the way too!

The setup

Learning Rate Schedulers

Learning rate is the key parameter in model training. The bigger the number, the faster the training and the more details are missed in the process. It is well explained visually here.

A constant training rate is one option. There are a few more available in the GUI at the time of writing:

An optimal training process involves a variable learning rate. You might want your model to recognize the key features of your dataset by learning fast at the beginning and then gradually decrease the learning rate for smaller details. A training process of this kind would look like this on a graph:

Here is what happens in the training process when you choose Constant, Linear, Cosine or Cosine with restarts schedulers:

Warmup steps for Constant and Linear are set in % and allow for gradual increase of the learning rate during the first X% steps of the training.

This is possibly beneficial for datasets with a ton of intricate details that might get lost in the sauce if you start too fast.

Note that you can set LR warmup to 100% and get a gradual learning rate increase over the full course of the training.

Cosine needs no explanation.

Cosine with restarts is actually Cosine with warm restarts, which means that only the ‘good weights’ are used as the starting point of each restart (in contrast to a cold restart where a new set of random numbers is used).

The amount of restarts can be set in LR number of cycles at the Advanced Configuration tab. By default it is equal to the number of epochs.

Polynomial LR scheduler works like this:

Its power can be set in LR power at the Advanced Configuration tab.

Polynomial curve is more intense compared to Cosine scheduler, so it could be beneficial when your dataset doesn’t have many distinct features to be learned, but you want the training to be more efficient compared to Constant scheduler.

Adafactor is probably an adaptive LR Scheduler (??), which probably pairs with Adafactor optimizer. More on adaptive optimizers below.

Optimizers

A quick word about optimizers before this post gets too long.

I haven’t looked into adaptive optimizers. LoRA training process has way too many volatile variables already, which makes it difficult to pinpoint the areas worth debugging. Adding a black box like adaptive optimizer would probably make things even more cloudy. I might be wrong, but that’s where I am at the moment.

As for the non-adaptive ones, Lion optimizer (actually inspired by lions and their cooperation!) looks fascinating, but I haven’t found a common ground with it so far.

I will be using AdamW for my tests today.

The dataset

Let’s teach Reliberate source model to recognize someone it has no idea about.

Turns out it has no idea about Elina Löwensohn’s character Sofia Ludens from Hal Hartley’s 1994 film Amateur, nor does it know anything about Belgian model and actress Delfine Bafort.

‘Sofia Ludens’ as seen by vanilla Reliberate

‘Delfine Bafort’ as seen by vanilla Reliberate

I will use two small datasets (in separate tests) to balance the results of my tests.

DB Dataset

This dataset is near perfect for face training if you are not looking to generate a range of emotions. The facial expressions are stable, which makes it easier to recognize the person, the lighting is solid and there are plenty of different details to caption out:

dfw is a single token character combination I will be using to identify the subject in training.

You can get a list of all single tokens up to 4 characters here. Ideally check if your pick has any specific meaning for the source model.

SL Dataset

SL is much more difficult. The lighting on most of the images is flat (higher contrast is better for subject recognition). There are too many angles and facial expressions for such a small dataset. On top of that I only captioned out the backgrounds and the phone handset, so that the model learns the character as a whole, dress and necklace included.

Class images

I generated 10 class (regularisation) images for each training image in both datasets. All images were generated using the source model (Reliberate) with the same captions as the training images, only the instance token was replaced with ‘woman’.

Properly captioned and generated class images help with making the training process more efficient by spotlighting the right blocks for the model.

Folder Structure

I have 13 training images in SL dataset and I want them to be trained alongside all 130 class images over the course of each epoch, so I am going to repeat them 10 times.

I am using ‘slo’ as instance token for SL dataset

Note that the folder name is ignored by kohya scripts when caption files are present.

Core Settings

Source model: Reliberate
Resolution: 512x512
Epochs: 18
LR Scheduler: Cosine
Text Encoder Learning Rate: 5e-5
Unet Learning Rate: 1e-4
Prior loss weight: 1

I will be training for 18 epochs, which gives me 2,340 steps for SL dataset (18x130) and 2,520 steps for DB (18x140).

The Results

DL dataset

*Prompt: dfw woman, medium headshot, wheat field*

Same seed was used for these generations. Not much is clear apart from the fact that less details are learned with Dim 8 and colors are a bit more vivid with Alpha 1.

Let’s look under the hood of the training process. I was generating samples at the end of each epoch with the same prompt (dfw woman, headshot, sunflower field) and the same seed.

You can generate samples with the same seed by adding --d 1234 at the end of your sample prompt. Any other number instead of 1234 will do, of course.

Dim 128, Alpha 128:

Dim 128, Alpha 1:

Dim 8, Alpha 8:

Dim 8, Alpha 1:

Okay! This is more interesting.

Looks like Network Rank (Dimension) is the number of features to be trained, while Network Alpha is how many of them are allowed to alter the source model.

The colors seem to get more vivid when LoRA and the source model are not properly aligned, as if their handshake is unstable. It’s happening at the first steps of every test and doesn’t really go away when Alpha=1.

To make sure Dim 8 is actually limited in trained features, let’s try increasing both learning rates.

Dim 8, Alpha 8, Text Encoder LR 1e-4, Unet LR 5e-4:

Much better than with the original learning rates, and yet still limited in trained features.

The key takeway here is that we need to increase the learning rate when we limit Network Dimension. Our tool is dull and small with lower Dim, so we need to work faster to achieve something.

No wonder kohya scripts default settings are at Dim 8, Alpha 1. It promotes a speedy training process, but the dataset requirements must be quite strict in this case.

Let’s look at a couple more sample sets.

Dim 128, Alpha 128, Text Encoder LR 0, Unet LR 1e4:

Dim 128, Alpha 128, TE LR 5e5, Unet LR 0:

As expected, lol.

Okay, it looks like we are stalling at the last ~ 6 epochs. What if we change the scheduler from Cosine to Cosine with restarts?

Dim 128, Alpha 128, Cosine with 3 restarts (every 6 epochs):

Helpful, but not too much.

Dim 128, Alpha 128, Cosine with 18 restarts (every epoch):

Quite helpful, actually! Good to know.

Before I get to the SL dataset, here is an interesting mistake to look at. At the beginning of my tests I accidentally trained LoRA on a different source model (Paragon instead of Reliberate) using the class images generated with Reliberate.

Dim 128, Alpha 128, Paragon source model with Reliberate class images:

The colors are more vivid here, again, possibly an indication of a weaker handshake between LoRA and the model. The symmetrical sunflowers in the last epochs (seen at the original Dim128/Alpha128 above as well) got much more ridiculous here.

This sample set also suggests that training is more efficient with native class images = generated by the source model. A few more examples:

Dim 128, Alpha 1, Paragon source model with Reliberate class images:

Dim 8, Alpha 8, Paragon source model with Reliberate class images:

Dim 8, Alpha 1, Paragon source model with Reliberate class images:

Yeah.

SL dataset

This is where things get even more interesting!

In case you forgot how this dataset looked like, here is a reminder:

Let’s start with a slo woman, headshot, sunflower field over the course of 18 epochs with original core settings.

Dim 128, Alpha 128:

Hello? I knew it will be challenging, but I still expected a bit more than this by the end of the training process. The 1st epoch sample was so promising, and then we didin’t get anywhere somehow.

For the sake of experiment I’d rather keep the number of steps as is. Let’s try changing the Dimension instead.

Dim 256, Alpha 256:

You can see that it’s trying, probably needs a more intense learning curve?

Feels like the right time for Polynomial!

Approximate visualization of Polynomial curve with a power of 0.1

Dim 256, Alpha 256, Polynomial with LR Power of 0.1:

Right! Got pretty overbaked at the end, but the direction is good.

They say that dataset is the king of the training, and these samples definitely reflect that. The mix of facial expressions in SL dataset of poorly captioned 13 images (only the backgrounds and a phone handset excluded) combined with a mix of high contrast and flat lighting result in a nearly impossible training task for the model.

To wrap things up, let’s see if Cosine with restarts could save us here.

Dim 256, Alpha 256, Cosine with 18 restarts:

No, not really. Better than everything else though!

I hope you’ve got a better understanding of balancing Learning Rates and Schedulers with Network Dimension and Alpha settings depending on your dataset and training goals. Feel free to reach out in the comments to discuss the key takeaways and testing practices.

My next adventure is training a style with block weights. Subscribe to stay on board!