Monitoring your Stable Diffusion fine-tuning with Neptune

Pedro Gengo Lourenço
12 min readJul 11, 2023

--

In recent months, you might have noticed a surge in the creation of personalized avatars, each with their unique styles. These creations are the products of diffusion models, facilitated by a novel technique known as DreamBooth, designed to customize text2image models using a minimal number of images.

A cursory online search yields a wealth of resources, including notebooks, blog posts, and more, guiding you to train your model independently. However, you may have experienced that balancing underfitting or overfitting your model can be a tricky task. So, how does one ensure more control during the fine-tuning process? How can we monitor the learning progress of our model? How can we discern which checkpoint outperforms another? This is where neptune.ai comes in.

Understanding DreamBooth

Consider a model well-versed in diverse concepts such as cats, dogs, zebras, giraffes, and others. If one were to instruct this model to generate an image of a dog inside a bucket, it would likely succeed without issue. However, what if the task was to depict your specific dog within a bucket, especially considering the model has never seen your dog before? Furthermore, how could one distinguish your dog from any other canine?

Addressing these questions, Google researchers have developed a technique called DreamBooth. The authors of this method describe it as follows:

Given ∼ 3−5 images of a subject we finetune a text-to-image diffusion model with the input images paired with a text prompt containing a unique identifier and the name of the class the subject belongs to (e.g., “A [V] dog”), in parallel, we apply a class-specific prior preservation loss, which leverages the semantic prior that the model has on the class and encourages it to generate diverse instances belong to the subject’s class using the class name in a text prompt (e.g., “A dog”).

Fine-tuning process using DreamBooth. Extracted from DreamBooth paper.

In other words, you instruct your model to produce a unique “variation” of a dog and associate this variation with a unique identifier. By doing this, the model learns that the token “[V]” should generate images that resembles your input images. Leveraging this understanding, you can then generate new images of your specific dog, for instance.

Base code for DreamBooth

In this guide, we constructed our foundational code utilizing the DreamBooth code from Hugging Face as a basis. All files corresponding to this tutorial can be accessed at: https://github.com/pedrogengo/dreambooth/tree/main. To keep things simple, we’ve omitted some portions of the code. In this version, we’re focusing solely on training the diffusion model, without incorporating any class prior.

We can split the code in three principal components:

  • Dataset definition
  • Training loop
  • Model saving

Dataset definition

Building the dataset for DreamBooth requires two elements:

  • A set of images
  • An identifier

Since we only have a few images of the new concept we want the model to pick up, we can use image augmentation techniques to improve the training. But we have to be careful not to use an augmentation that changes the image so much that it doesn’t look like the original one.

The DreamBoothDataset class, which is provided below, is a key part of this process. The most important method is __getitem__. This is where we set up the system to return the image and its tokenized identifier. In this same method, we apply image augmentation. We only use center crop or random crop because they don’t change the original images too much.

class DreamBoothDataset(Dataset):
"""
A dataset to prepare the instance and class images with the prompts for fine-tuning the model.
It pre-processes the images and the tokenizes prompts.
"""

def __init__(
self,
instance_data_root,
instance_prompt,
tokenizer,
size=512,
center_crop=False,
encoder_hidden_states=None,
instance_prompt_encoder_hidden_states=None,
tokenizer_max_length=None,
):
self.size = size
self.center_crop = center_crop
self.tokenizer = tokenizer
self.encoder_hidden_states = encoder_hidden_states
self.instance_prompt_encoder_hidden_states = instance_prompt_encoder_hidden_states
self.tokenizer_max_length = tokenizer_max_length

self.instance_data_root = Path(instance_data_root)
if not self.instance_data_root.exists():
raise ValueError(f"Instance {self.instance_data_root} images root doesn't exists.")

self.instance_images_path = list(Path(instance_data_root).iterdir())
self.num_instance_images = len(self.instance_images_path)
self.instance_prompt = instance_prompt
self._length = self.num_instance_images

self.image_transforms = transforms.Compose(
[
transforms.Resize(size, interpolation=transforms.InterpolationMode.BILINEAR),
transforms.CenterCrop(size) if center_crop else transforms.RandomCrop(size),
transforms.ToTensor(),
transforms.Normalize([0.5], [0.5]),
]
)

def __len__(self):
return self._length

def __getitem__(self, index):
example = {}
instance_image = Image.open(self.instance_images_path[index % self.num_instance_images])
instance_image = exif_transpose(instance_image)

if not instance_image.mode == "RGB":
instance_image = instance_image.convert("RGB")
example["instance_images"] = self.image_transforms(instance_image)

if self.encoder_hidden_states is not None:
example["instance_prompt_ids"] = self.encoder_hidden_states
else:
text_inputs = tokenize_prompt(
self.tokenizer, self.instance_prompt, tokenizer_max_length=self.tokenizer_max_length
)
example["instance_prompt_ids"] = text_inputs.input_ids
example["instance_attention_mask"] = text_inputs.attention_mask

return example

Training loop

This is the most critical part of the entire code. To make it easier to explain what’s going on inside the training loop, we’ve drawn the diagram below:

High level view of the training loop in DreamBooth

First, we have our inputs, which are the images and the unique identifier. Using a text encoder, we transform our identifier into what’s known as embeddings (depicted as the green array in the image). Then, with the help of an image encoder, often a Variational Auto Encoder (VAE), we turn the images into a compressed form (shown as the pink array in the image).

For each run in the training loop, we add a random amount of noise to the image embedding. The diffusion model then tries to predict this noise in the embedding using both the image and text embeddings.

Next, we use a loss function to compare the known noise with the one the model predicted. We repeat this process multiple times until the diffusion model learns to predict the noise within the image embeddings using the given identifier. In other words, the model learns to generate the new concept.

In the code below, we first get the pixel_values, encode it using the VAE, and introduce some random noise. Then, we encode the input_ids, which are essentially the tokenized identifier. Afterward, we call on the diffusion_model to predict the noise. Finally, we compute the loss function using the predicted noise and the actual noise.

pixel_values = batch["pixel_values"].to(dtype=weight_dtype)

if vae is not None:
# Convert images to latent space
model_input = vae.encode(batch["pixel_values"].to(dtype=weight_dtype)).latent_dist.sample()
model_input = model_input * vae.config.scaling_factor
else:
model_input = pixel_values

# Sample noise that we'll add to the model input
if args.offset_noise:
noise = torch.randn_like(model_input) + 0.1 * torch.randn(
model_input.shape[0], model_input.shape[1], 1, 1, device=model_input.device
)
else:
noise = torch.randn_like(model_input)
bsz, channels, height, width = model_input.shape
# Sample a random timestep for each image
timesteps = torch.randint(
0, noise_scheduler.config.num_train_timesteps, (bsz,), device=model_input.device
)
timesteps = timesteps.long()

# Add noise to the model input according to the noise magnitude at each timestep
# (this is the forward diffusion process)
noisy_model_input = noise_scheduler.add_noise(model_input, noise, timesteps)

# Get the text embedding for conditioning
if args.pre_compute_text_embeddings:
encoder_hidden_states = batch["input_ids"]
else:
encoder_hidden_states = encode_prompt(
text_encoder,
batch["input_ids"],
batch["attention_mask"],
text_encoder_use_attention_mask=args.text_encoder_use_attention_mask,
)

if accelerator.unwrap_model(unet).config.in_channels == channels * 2:
noisy_model_input = torch.cat([noisy_model_input, noisy_model_input], dim=1)

# Predict the noise residual
model_pred = unet(
noisy_model_input, timesteps, encoder_hidden_states, class_labels=None
).sample

if model_pred.shape[1] == 6:
model_pred, _ = torch.chunk(model_pred, 2, dim=1)

# Get the target for loss depending on the prediction type
if noise_scheduler.config.prediction_type == "epsilon":
target = noise
elif noise_scheduler.config.prediction_type == "v_prediction":
target = noise_scheduler.get_velocity(model_input, noise, timesteps)
else:
raise ValueError(f"Unknown prediction type {noise_scheduler.config.prediction_type}")
loss = F.mse_loss(model_pred.float(), target.float(), reduction="mean")

Model saving

Since we’re using the diffusers library from Hugging Face, saving the model is as simple as calling the save_pretrained method from StableDiffusionPipeline class.

pipeline.save_pretrained(args.output_dir)

Instrumenting our code using Neptune.ai

Before delving into the fine-tuning process of Stable Diffusion and monitoring it with Neptune, it’s essential to be aware of potential pitfalls that can arise from a lack of adequate monitoring or tracking:

  • Running numerous experiments without careful note-taking of the parameters used in each can lead to confusion and difficulty in replicating the results of beneficial trials;
  • If you only save the final checkpoint, which might be overfitted for instance, you miss out on the opportunity to utilize intermediate checkpoints. In such a scenario, you’d need to re-run the training process to see if you can achieve a satisfactory result in fewer epochs. This could have been avoided if the intermediate results had been saved.;
  • You may end up with numerous checkpoints but lack qualitative image evaluations for each. This omission would necessitate loading each checkpoint individually and running predictions to conduct a qualitative evaluation, which can take so much time.

With these potential issues in mind, it’s vital to select the training aspects we intend to monitor or log. When fine-tuning Stable Diffusion with DreamBooth, it is useful to record training session parameters such as the number of epochs, identifier, batch size, and others.

Furthermore, generating and storing validation images every N epochs, along with their corresponding checkpoints, can provide a useful visual record of the learning process. This practice also allows for the identification and selection of the best checkpoint for our trained model.

So, in short, we want to:

  • Log training arguments
  • Save validation images every N epochs
  • Save model checkpoint every N epochs

Now, let’s see how we can easily achieve these tasks using neptune.ai!

Logging training arguments

To log training arguments, we’ll first need to create a run.For this, we need two things: a Neptune API Token and a project name. We can pass both of these as arguments to our DreamBooth file. Here’s how:

parser.add_argument(
"--neptune_token",
required=False,
default=None,
help="Token to use Neptune.ai to log your fine-tuning task.",
)

parser.add_argument(
"--neptune_project",
required=False,
default=None,
help="Project name to log dreambooth information and artifacts.",
)

Each time the user passes neptune_token and neptune_project as arguments, we’ll need to initiate a run and log all other arguments within that run. To do this, we should insert the following conditional statement at the start of the main function:

if args.neptune_token or args.neptune_project:
if args.neptune_project and args.neptune_project:
import neptune
from neptune.utils import stringify_unsupported
global run
run = neptune.init_run(
project=args.neptune_project,
api_token=args.neptune_token)
neptune_args = dict()
for k, v in vars(args).items():
if k != "neptune_project" and k != "neptune_token":
neptune_args[k] = stringify_unsupported(v)
run["parameters"] = neptune_args
else:
raise ValueError("You should specify a project name.")

With this setup, every time we fine-tune the model, all the corresponding arguments will be saved in the Neptune run.

Saving validation images

To save validation images, we first need to generate these images! To accomplish this, we’ll create a helper function that we’ll call every N steps. This is because validation takes time, and calling it at every step would significantly extend the training duration.

images = []
for _ in range(args.num_validation_images):
with torch.autocast("cuda"):
image = pipeline(**pipeline_args,
num_inference_steps=25,
generator=generator).images[0]
images.append(image)

widths, heights = zip(*(i.size for i in images))
total_width = sum(widths)
max_height = max(heights)

concat_img = Image.new('RGB', (total_width, max_height))

x_offset = 0
for im in images:
concat_img.paste(im, (x_offset, 0))
x_offset += im.size[0]

run["validation/images"].append(concat_img, step=global_step,
description=f"Step: {global_step} Prompt: {args.validation_prompt}")

Saving model checkpoint

When it comes to artifacts, such as the model checkpoint at every N steps, Neptune provides us with two ways to handle files:

  • Tracking files: You can log metadata about datasets, models, and any other artifacts that can be stored as files;
  • Uploading files: You literally upload the file to Neptune.

So, for our use case, which method should we choose? Given that we’ll have disk space, we can save different checkpoints locally and merely track the metadata of our checkpoints. This allows us to locate the checkpoint we prefer upon reviewing the validation images.

If we opted to upload the files, since these files are quite large (often >5GB), we’d need to be very careful not to overwrite a file while it’s being sent to Neptune. This would probably mean we’d have to duplicate files locally or wait for the upload to finish, which would significantly extend the training time.

To keep track of the files, we simply need to add the following lines to our helper function, which we discussed in the previous section.

output_step_dir = os.path.join(args.output_dir, f"step-{global_step}")
pipeline.save_pretrained(output_step_dir)
run[f"train/checkpoint/step-{global_step}"].track_files(output_step_dir)

Hands-on: Customizing a txt2image model

Let’s delve into the implications of running an experiment without any tracking or monitoring system, like Neptune. Consider the following scenario: We perform an experiment over 1200 epochs, employing a learning rate of 5e-7, while only saving the checkpoint from the final epoch. After completing the training, we instruct the model to generate “A photo of sks dog in a bucket” using the saved checkpoint. Here’s the outcome:

Images generated at epoch 1200 using a learning rate of 5e-7.

Upon examination, we observe that the model has overfitted. The image lacks a discernible bucket, and the model appears to have predominantly learned to generate sks dog in the output. But consider this: we trained for a 1200 epochs. What if we had thought to save some intermediate checkpoints and validation images? Could they have helped us select a checkpoint more suited to our needs?

Absolutely! If we had implemented monitoring and tracking, we would have discovered that the model had successfully grasped the concept without any degradation by the 800th epoch. Furthermore, we would have been in a position to choose the checkpoint associated with the best-generated images during validation due to our checkpoint tracking. Now, let’s dive into how we can achieve all this using Neptune!

First of all, you need to login in your Neptune account, create a project and get your API token.

After this, you can follow the provided Colab notebook (you should be able to run in a T4 instance), or run the same code in your local environment. You just need to carry out the following steps:

  • Clone the repository and install the dependencies;
  • Download your input images into a folder;
  • Choose an identifier, such as 'sks';
  • Replace the placeholders in the code below, with your specific definitions.
%%shell
export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export INSTANCE_DIR="<INPUT FOLDER>"
export OUTPUT_DIR="<OUTPUT FOLDER>"

accelerate launch dreambooth/train_dreambooth_neptune.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--instance_data_dir=$INSTANCE_DIR \
--output_dir=$OUTPUT_DIR \
--instance_prompt="<IDENTIFIER>" \
--resolution=512 \
--train_batch_size=1 \
--gradient_accumulation_steps=2 --gradient_checkpointing \
--use_8bit_adam \
--learning_rate=5e-6 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_train_steps=800 \
--validation_prompt="<SOME PROMPT YOU WANT TO USE TO EVALUATE>" \
--neptune_token="<NEPTUNE API TOKEN>" \
--neptune_project="<NEPTUNE PROJECT NAME>"

You also have the flexibility to experiment with various parameters like the learning rate, whether or not to train the text encoder, the number of epochs, and so forth. For this experiment, we’ll use the default values for most of the parameters. This means we’ll run a validation step every 100 steps and generate 4 images from the same validation prompt.

For input images, we want to learn to generate the dog below and we will use the prompt "A photo of sks dog in a bucket" to evaluate the learning process.

Custom dog that we want to learn to generate

As a result of the fine-tuning process, you will be able to view the logged training arguments under the parameters section:

Logged training arguments

You will also be able to view the generated validation images, which are logged every 100 steps, under the Images tab:

Images logged every 100 steps

And, finally, you can monitor each checkpoint under the Artifacts tab:

Checkpoint artifacts tracked using Neptune

Results

In the beginning iterations (like step 100), we could tell that the model didn’t really know how to generate our unique dog. But towards the end of the fine-tuning process, it’s clear that it actually learned how to represent our dog accurately.

Images generated at step 100 using a learning rate of 5e-7
Images generated at step 800 using a learning rate of 5e-7

Conclusion

In conclusion, we’ve seen that when we aim to customize a text2image model, the DreamBooth technique delivered excellent results. Additionally, we’ve observed how using a tool like Neptune to log and organize your experiments can aid in decision-making and allow the visualization of intermediate results as the training unfolds. Moreover, using Neptune has saved us a significant amount of time since we avoided having to retrain the model for fewer epochs, given that we were saving the intermediate checkpoints. It prevented validation errors, as the validation was carried out within the epoch itself, reducing manual errors in loading the checkpoints. Additionally, it saved computational resources, as we utilized components already loaded during other parts of the training for validation, avoiding the need to reload them.

--

--

Pedro Gengo Lourenço

Machine Learning Engineer with +5 years of experience / Google Developer Expert in Machine Learning