Fine-Tuning Stable Diffusion 3 Medium with 16GB VRAM

Filippo Santiano
5 min readJul 2, 2024

--

Stable Diffusion 3 (SD3) Medium is the most advanced text-to-image model that stability.ai has released. It’s smaller than other models, such as SDXL, yet still produces high-quality images, understands complex prompts and performs inference quickly. Despite its smaller size, fine-tuning SD3 Medium out of the box on a GPU with 16GB VRAM isn’t possible. GPUs with more than 16GB VRAM cost significantly more, whether you’re buying a GPU directly or using it through a cloud service.

Fortunately, quantizing one of the text encoders can significantly reduce the memory used during fine-tuning, allowing for customisation on a 16GB VRAM GPU. This drastically reduces costs and increases the accessibility of model customisation. We also used LoRA (Low-Rank Adaptation of Large Language Models) to further reduce VRAM usage during fine-tuning.

This post provides you with all the files and steps needed to achieve this. For reference, I fine-tuned my model on a gd4n.2xlarge instance on AWS, which has one GPU (16GB VRAM) and 16 vCPUs (32GB RAM).

Overview of Quantization

In deep learning, quantization refers to reducing the precision (number of bits) of weights and activations of neural networks, whilst aiming to minimise losses in accuracy. This significantly decreases memory usage, allowing models to be trained more efficiently. You can find out more about quantization here.

Fine-Tuning Example

Below are a couple of photos of my adorable cat, Lily! We want to fine-tune SD3 Medium to generate an image of a cat that looks like Lily when we specify her name in the prompt.

Using the method from this post, I trained SD3 Medium on ten images of Lily. Below, you can see the outputs of the original and fine-tuned model from the prompt: “A cat called Lily, sat on a mat.”. The trained model produced an image of a cat that looked much more like Lily! This is just one of the many ways you can customise the model to suit your needs.

Original SD3 Medium output (Left) and fine-tuned SD3 Medium output (right).

Install gcc and g++ 9.5.0

Make sure you have gcc and g++ installed. These are both part of the GNU compiler collection and are needed for compiling C++ and C programs.

sudo apt-get install gcc-9 g++-9

Install Conda

The first step is to ensure that you have Conda installed; I used the lightweight installer Miniforge.

Required Files

Clone the GitHub repository with the required files, which can be found here.

git clone https://github.com/FilippoO2/Quantized-Training-of-SD3.git
cd Quantized-Training-of-SD3

Create a Conda Environment

Create a Conda environment (train_SD3) from the conda_config.yaml file.

conda env create -f conda_config.yaml
conda activate train_SD3

diffusers and diffusers-0.30.0.dev0.dist-info contain changes that are required for the quantization to work. Place these into your conda environment’s site-packages (e.g. ~/miniforge/envs/train_SD3/lib/Python-3.12/site-packages/). You can do this using:

mv diffusers/ ~/miniforge3/envs/train_SD3/lib/python3.12/site-packages/
mv diffusers-0.30.0.dev0.dist-info/ ~/miniforge3/envs/train_SD3/lib/python3.12/site-packages/

If you have not used miniforge, the destination path will be slightly different.

Accessing SD3 Medium from Hugging Face

Head over to SD3 Medium on Hugging Face where you must create an account if you don’t already have one and agree to the SD3 Medium license.

Next, go to your profile settings in Hugging Face and select Access Tokens from the left menu. Create and copy a token, which you can then use to log in to Hugging Face through your terminal with the huggingface-cli login command. This should now be stored in ~/.cache/huggingface/token

Next, we must configure accelerate. This is done by running accelerate config but editing the file directly at ~/.cache/huggingface/accelerate/default_config.yaml is easier. Notet that you might have to run and complete accelerate config for accelerate to appear in the cache.

mv default_config.yaml ~/.cache/huggingface/accelerate/

We will train our model using the train_dreambooth_lora_sd3.pyscript that has been adapted to reduce memory usage. The largest text encoder, text_encoder_3, and its tokenizer have been quantized. Text encoder 1 and 2 remain unchanged.

Fine-tuning the Model

Specify the model being used (MODEL_NAME), where the training images are located (INSTANCE_DIR), and where to save our model (OUTPUT_DIR).

export MODEL_NAME="stabilityai/stable-diffusion-3-medium-diffusers"
export INSTANCE_DIR="path/to/training_images"
export OUTPUT_DIR="./fine_tuned_model"

We can now begin training! Make sure to change instance_prompt to the appropriate prompt for your images.

accelerate launch train_dreambooth_lora_sd3.py \
--pretrained_model_name_or_path=${MODEL_NAME} \
--instance_data_dir=${INSTANCE_DIR} \
--output_dir=${OUTPUT_DIR} \
--mixed_precision="bf16" \
--instance_prompt "PROMPT FOR TRAINING IMAGES" \
--resolution=512 \
--train_batch_size=4 \
--gradient_accumulation_steps=4 \
--learning_rate=0.0001 \
--report_to="wandb" \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=1000 \
--weighting_scheme="logit_normal" \
--seed="42" \
--use_8bit_adam \
--gradient_checkpointing \
--prior_generation_precision="bf16"

Running Inference

It may take some time, but your model should train and output a .safetensors file in your OUTPUT_DIR. Before we can test it, we need to add config.json to the OUTPUT_DIR:

mv config.json ${OUTPUT_DIR}

You can now run inference with your new model (run_trained.py):

python run_trained.py

You can adjust the balance between the original and fine-tuned model by changing lora_scale. Increasing the value of the scale produces results more similar to the fine-tuned examples, whereas a lower scale value returns an image more similar to the base SD3 Medium output.

Check out output.png to see the results! To compare your trained model with untrained SD3 Medium, make these changes in run_trained.py:

  1. Comment out lines 10 and 11.
  2. Change the name of your output image.
  3. Re-run python run_trained.py
  4. Compare the two images to see the difference!

Summary

In this post, we explored how to reduce VRAM usage during Stable Diffusion 3 Medium training. By quantizing the largest text encoder and making small adjustments to the diffusers package, we can perform training and inference on smaller GPUs, significantly lowering the cost of these processes.

If you want to train your model using multiple prompts, check this post out.

--

--