Train ControlNet with DiffEngine

4 min readOct 4, 2023

ControlNet: Adding Conditional Control to Text-to-Image Diffusion Models

Text-to-image generation is a challenging and exciting task, aiming to transform natural language descriptions into real images. However, a significant limitation of most existing methods is their inability to exert precise control over the image generation process using additional inputs such as edge maps, depth maps, segmentation maps, or keypoints. This limitation not only constrains the diversity and quality of the generated images but also hampers the potential applications of text-to-image generation.

In this blog post, I am delighted to introduce ControlNet — a neural network designed to provide enhanced control over pretrained diffusion models, allowing for seamless integration of additional input conditions. We will also delve into the process of training ControlNet with DiffEngine.

ControlNet Method

ControlNet is an approach that refines a pre-trained model by incorporating additional conditions, such as keypoints, to generate images tailored to these conditions. During the fine-tuning process, the pre-trained model remains fixed, and only the weights associated with these extra conditions are updated. To ensure that no harmful noise is introduced during fine-tuning, these additional weights are connected via zero convolution, initially set to zero and gradually adjusted throughout training.

In the original paper, the ControlNet are copied from the base model. However, when using a large model such as SDXL, the copied ControlNet weights will also be large. See Table 1 from SDXL paper. They use 0, 2 and 10 transformer blocks for each encoder. To solve this problem, Diffusers team limits the number of layers per block in the transformer.

In the original paper, ControlNet weights directly derived from the base model. However, when dealing with a large model like SDXL, these copied ControlNet weights also become substantial. You can refer to Table 1 in the SDXL paper for more details. The Unet Encoder in SDXL utilizes 0, 2, and 10 transformer blocks for each feature level. To address this issue, the Diffusers team limits the number of transformer blocks.

Image from SDXL paper https://arxiv.org/abs/2307.01952

In the case of ControlNet Small, the Diffusers team completely removes all transformer layers from the Encoder, resulting in Transformer blocks of [0, 0, 0]. This checkpoint is approximately seven times smaller than the original XL ControlNet checkpoint. For ControlNet Mid, the configuration includes one transformer block at lowest level, denoted as Transformer blocks [0, 0, 1].

diffusers/controlnet-canny-sdxl-1.0-small · Hugging Face

We're on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Train ControlNet with DiffEngine

DiffEngine GitHub: https://github.com/okotaku/diffengine
DiffEngine documentation: https://diffengine.readthedocs.io/en/latest/

Installation

Before installing DiffEngine, please ensure that PyTorch has been successfully installed following the official guide.

https://pytorch.org/get-started/locally/

Install DiffEngine

pip install openmim
pip install git+https://github.com/okotaku/diffengine.git

Training with pre-defined config

A variety of pre-defined configs can be found in the configs directory of the DiffEngine repository.

ControlNet Configs: https://github.com/okotaku/diffengine/tree/main/configs/stable_diffusion_controlnet
ControlNet SDXL Configs: https://github.com/okotaku/diffengine/tree/main/configs/stable_diffusion_xl_controlnet
ControlNet Small / Mid SDXL Configs: https://github.com/okotaku/diffengine/tree/main/configs/stable_diffusion_xl_controlnet_small

For example, if you wish to train a ControlNet Small SDXL model, access the file configs/stable_diffusion_xl_controlnet_small/stable_diffusion_xl_controlnet_small_fill50k.py.

To train with a selected config, open a terminal and run the following command:

mim train diffengine stable_diffusion_xl_controlnet_small_fill50k.py

The training process will begin, and you can track its progress. The outputs of the training will be located in the work_dirs/stable_diffusion_xl_controlnet_small_fill50k directory, specifically when using the stable_diffusion_xl_controlnet_small_fill50k config.

Inference with diffusers.pipeline

Once you have trained a model, simply specify the path to the saved model and inference by the diffusers.pipeline module.

import torch
from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel, AutoencoderKL
from diffusers.utils import load_image

checkpoint = 'work_dirs/stable_diffusion_xl_controlnet_small_fill50k/step37500'
prompt = 'cyan circle with brown floral background'
condition_image = load_image(
    'https://datasets-server.huggingface.co/assets/fusing/fill50k/--/default/train/74/conditioning_image/image.jpg'
).resize((1024, 1024))
controlnet = ControlNetModel.from_pretrained(
        checkpoint, subfolder='controlnet', torch_dtype=torch.float16)
vae = AutoencoderKL.from_pretrained(
    'madebyollin/sdxl-vae-fp16-fix',
    torch_dtype=torch.float16,
)
pipe = StableDiffusionXLControlNetPipeline.from_pretrained(
    'stabilityai/stable-diffusion-xl-base-1.0', controlnet=controlnet, vae=vae, torch_dtype=torch.float16)
pipe.to('cuda')
image = pipe(
    prompt,
    condition_image,
    num_inference_steps=50,
).images[0]
image.save('demo.png')