A Comprehensive Guide to Training a Stable Diffusion XL LoRa: Optimal Settings, Dataset Building, Captioning and Model Evaluation
In this guide, we will be sharing our tried and tested method for training a high-quality SDXL 1.0 LoRa model using the Kohya SS GUI (Kohya). Training Loras can seem like a daunting process at first, but when you dive into it, it soon becomes straightforward. Today we will cover the following key topics:
- Building your dataset
- Mastering captioning
- Using the right training configuration
- Testing your model
If you don’t have Kohya set up already, you can follow the installation instructions provided in this GitHub repo: https://github.com/bmaltais/kohya_ss.git
You will need at least 24 GB of VRAM to use Kohya with our configuration. How to install and run Kohya in the cloud is a complex question which deserves its own guide — we covered the question in detail here.
Building Your Image Dataset
The training data is the backbone of any AI model and, as such, this part deserves a great deal of attention. A single low-quality image in a set of 20 can dramatically impact the quality of your output. Even with larger datasets, this can be a problem. For example, if you have a 400 image style dataset, split between pictures of people, landscapes, etc. and one of the 20 landscape images is low quality, the model might struggle to make high quality landscapes later on.
We recommend that you go through all of your images one by one and ask yourself the question — ‘would I be happy if the model produced an image like that?’ If yes, the second thing to ask yourself is if the image has something odd that the model could find confusing. This requires a bit more experience to get right, but it mainly comes down to whether the concept you want the model to learn is clearly visible, and whether things like lighting, image composition, etc. are not too far outside of the ordinary.
In most cases, you will be aiming to have a dataset of between 15 and 30 images. Adding more images can help increase the flexibility of your model later on, but make sure that any extra image you add is providing something new or unique. For example, if you are training a style and you don’t have any landscape images in that style, adding a landscape image can improve your model’s flexibility. Adding images simply for the sake of adding images will slow training time without adding any value, and could even make it harder to train a high quality model.
In general, if you are training for an object or a character, incorporating diverse styles and points of view is a good idea. Likewise, if you are training for a style, it is good to have images with a variety of compositions and featuring different content.
Another important rule to note is that your model will find it harder to generate images that were not represented in your dataset. For example, if you don’t have any landscape images in your style dataset, your model will find it harder to make landscape images in that style down the line.
Lastly, you should make sure that all your training images are at least 1024x1024. There is no need to crop your images to a certain resolution — as long as they hit the minimum threshold, the training script will handle them.
Making captions
The key concept when it comes to captioning is to accurately describe every element in the image that holds significance, or that you want to be able to avoid during the generation process.
For example, if you want to have a lot of control over points of view later on, make sure to caption the point of view of all your images. Similarly, if you are training a character that is wearing a hat on all the training images and you want to be able to generate it without the hat, make sure to caption the hat on every image (otherwise, the model will think that the hat is part of the character).
Pro tip: This line of thinking can be applied to image imperfections too. If an image has a blurry background, and you want to avoid blurry backgrounds later on, make sure to include that “blurry background” in your caption.
Another important concept that comes into play when making captions is rare tokens. Rare tokens are sequences of letters that Stable Diffusion doesn’t already associate with any concepts, such as skw, ukj, or most other random combinations of letters. These effectively act as blank sheets that can be associated with your concept during training. They are particularly useful if you need to train something unique and you don’t want Stable Diffusion’s prior knowledge to get mixed up with your concept. For example, if you want to train a model on yourself or a unique art style that Stable Diffusion doesn’t know, you should use a rare token to describe those concepts. In those cases, examples of valid captions would be: “Photo of skw man” or “drawing in skw style”.
Pro tip: The location of the rare token inside the caption will affect the meaning Stable Diffusion will associate with it. For example, if you have: “Photo of skw man wearing a suit” the model will associate the token with the man, while if you have “Photo a man wearing an skw suit”, it will associate the token with the suit.
When you are done, save all your captions in text files with the same names as the images they are paired with. You can put all your images and captions in the same folder — this is your training dataset.
Training Config: Balancing Likeness, Diversity, and Flexibility
Here is the set of configs I usually use for Lora training:
Folders and source model
Source model: sd_xl_base_1.0_0.9vae.safetensors (you can also use stable-diffusion-xl-base-1.0)
Image folder: <path to your image folder>
Output folder: <path to the folder where you want to save your outputs>
Model output name: <your model name>
Basic Parameters
Epoch: 30
Max train epoch: 30
Caption Extension: .txt
Cache latents to disk: true
LR Scheduler: constant
Optimizer: AdamW
Learning rate: 3e-05 (0.00003)
LR warmup (% of steps): 0
Max resolution: 1024,1024
Text Encoder learning rate: 3e-05 (0.00003)
Unet learning rate: 3e-05 (0.00003)
Network Rank (Dimension): 32
Network Alpha: 32
Advanced Parameters
Gradient checkpointing: true
Min SNR gamma: 5
Rate of caption dropout: 0.05
You can copy the configs directly in the Lora training tab of the Kohya user interface, you don’t need to touch the other parameters. With these parameters, you should be able to make a model with a good balance of likeness and flexibility in less than 30 epochs. The optimal epoch will vary between datasets, so I recommended saving as many of them as possible and testing them thoroughly when evaluating the model.
When naming your training data folder, the repeat rate should be set to 3 without setting any class tokens, meaning your folder name should be “3_” (Kohya reads the repeat rate and the class token from the folder name. Don’t worry if you don’t know what they are. As long as your folder is named correctly Kohya will do the work for you). For Lora training, you don’t need any regularisation images, so you can ignore that part when you add the link to your training dataset to Kohya.
The only tweaking that might be required between different training runs is the learning rate. This is because different datasets might work better with higher or lower rates. For Lora training, we use values between 3e-6 and 8e-5. You can start with 3e-5 and change it during future training runs if you are not happy with the results. Increasing the learning rate will usually increase the rate at which the model learns your concept at the expense of flexibility later on. A high learning rate can also make it harder for the model to learn subtle details.
Evaluating Your Model
Once you are done with training it is time to evaluate your model to see how well it performs. Unfortunately, there is no easy way to do this and you will have to evaluate the image you generate manually. Broadly, you will try to assess whether the model is undertrained or overtrained.
If overtrained, the model will be inflexible and tend to reproduce images that are very similar to the ones in the training set. Or, if it is really overbaked, it will start to generate images with defects. On the other hand, an undertrained model will struggle to make new images of your concept with an acceptable level of likeness.
Start by using some of your captions (verbatim) as prompts to generate images. If the generated images look very similar to the training image that shares the same prompt, it can be a sign of overtraining. On the other hand, if your concept is not being picked up during the generation, or looks very different to how it should, it can be a sign of undertraining. If the model passes this test, you can start testing it with more generic prompts.
The first thing to do if your model isn’t quite right is to try different epochs. If your model feels overtrained, earlier epochs might perform better, or vice versa if the model seems undertrained. If you are still out of luck, you will have to do another training run. Before starting to tweak the parameters, check your dataset and try to identify elements that might be causing issues. Making small changes to your data will often be enough to fix your model. If your dataset is flawless, it is time to dive into the training parameters. Experimenting with different learning rates would be a good place to start if it gets down to that. Another easy thing to try would be to use a different base model that is closer to what you are aiming for than vanilla SDXL.
And that’s it! I hope you find this guide useful.
If you have more questions about training and deploying solutions, get in touch with the team at www.lightsketch.ai.