Making Text-to-Image Models Smarter with ControlNet

Amritangshu Mukherjee
12 min readMar 19, 2023

--

Control diffusion models by adding extra conditions

Authors: Sai Bharghav, Prathmesh, Kavya, Alisha, Amritangshu

Text-to-image diffusion models have made impressive strides in recent years in generating high-quality images from textual descriptions. These models learn to transform natural language into pixel space, enabling them to generate realistic images that match the input description. However, there is still room for improvement in terms of control over the generated images. Adding conditional control to text-to-image diffusion models can allow users to guide image generation more precisely, resulting in more accurate and tailored outputs. In this context, the use of sketches, outlines, and poses as a means of conditional control is becoming increasingly popular. In this blog, we will explore the concept of conditional control in text-to-image diffusion models, specifically how various techniques can enhance the precision of image generation and the benefits of using such methods.

In this blog, we will explore the concept of combining ControlNet with stable diffusion and its potential for enhancing text-to-image generation.

What is Stable diffusion?

Stable diffusion is a variant of the diffusion model, a generative model that learns to generate high-quality images by modeling the distribution of intermediate noise sources. The diffusion model is a sequential process that starts with a fixed noise distribution and applies a sequence of transformations, each of which modifies the noise distribution to produce a new distribution that better matches the target distribution. The final distribution is used to generate the output image. The diffusion model has been shown to be effective in generating high-quality images, but it can be challenging to train due to the need to estimate complex conditional distributions.

A quick demo of what stable diffusion can do

Diffusion is the process that takes place inside the pink “image information creator” component. Having the token embeddings that represent the input text, and a random starting image information array (these are also called latents), the process produces an information array that the image decoder uses to paint the final image.

A quick look at the stable diffusion architecture

This procedure is carried out in stages. Each step adds more relevant data. We can inspect the random latents array and see how it translates to visual noise to get a sense of the process. In this case, visual inspection entails running it through an image decoder. Diffusion occurs in multiple steps, with each step operating on an input latents array and producing another latents array that more closely resembles the input text as well as all of the visual information the model picked up from all of the images the model was trained on. A set of these latents can be visualized to see what information is added at each step.

Using Stable diffusion on python using the diffusers pipeline

#!pip install -q diffusers==0.14.0 transformers xformers git+https://github.com/huggingface/accelerate.git
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
import torch

repo_id = "stabilityai/stable-diffusion-2-base"
pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, revision="fp16")

pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")

prompt = "Panda in space"
image = pipe(prompt, num_inference_steps=25).images[0]
image.save("astronaut.png")
A quick demo of using stable diffusion on our workbook

What is ControlNet?

ControlNet is a neural network structure that allows for explicit control over pretrained large diffusion models, enabling additional input conditions to be supported. The ControlNet is designed to learn task-specific conditions in an end-to-end way, making the learning process robust even when the training dataset is small, with less than 50k samples. This capability makes ControlNet a powerful tool for text-to-image generation, where the control variables can be used to specify desired attributes of the generated images, such as the shape, color, or texture.

How it works

ControlNet works by copying the weights of neural network blocks into two copies: a “locked” copy and a “trainable” copy. The “locked” copy preserves the original model, while the “trainable” copy is used to learn the specified conditions.

During training, ControlNet uses a technique called “zero convolution,” which is a 1x1 convolution with both weight and bias initialized as zeros. Before training, all zero convolutions output zeros, and ControlNet does not cause any distortion to the original model. The “trainable” copy learns the specified conditions, while the “locked” copy remains unchanged.

This approach has several advantages. First, it allows for training on small datasets of image pairs without compromising the production-ready diffusion models. Second, no layer is trained from scratch, which means that the original model is safe. Third, this approach allows for training on small-scale or even personal devices, which makes it practical for a wide range of applications.

The approach to apply a ControlNet to an arbitrary neural network block
Potential application of control net on branded content — Nike in this example

Using the StableDiffusion ControlNet Pipeline

we are using a Pipeline for text-to-image generation using Stable Diffusion with ControlNet guidance

Pre-trained Models available:

  • Control using Edge Detection: lllyasviel/sd-controlnet-canny: Trained with canny edge detection, used for controlling image edges.
  • Control using Pose Detection: lllyasviel/sd-controlnet_openpose: Trained with OpenPose bone image, used for controlling human poses.
  • Control using Scribble Detection: lllyasviel/sd-controlnet_scribble: Trained with human scribbles, used for controlling image outlines.
  • lllyasviel/sd-controlnet-depth: Trained with Midas depth estimation, used for controlling image depth.
  • lllyasviel/sd-controlnet-hed: Trained with HED edge detection (soft edge), used for controlling image soft edges.
  • lllyasviel/sd-controlnet-mlsd: Trained with M-LSD line detection, used for controlling straight lines in an image.
  • lllyasviel/sd-controlnet-normal: Trained with normal map, used for controlling image normals.
  • lllyasviel/sd-controlnet_seg: Trained with semantic segmentation, used for controlling image semantic segmentation.

Step 1: Setting up the environment and pipeline

First, we have to set up our environment and import stable diffusion from hugging face, here we are using stable-diffusion-v1–5

The code installs some packages using pip and then imports modules and classes from these packages.

The first line installs the following packages:

  • diffusers==0.14.0: A package for implementing diffusion models in PyTorch.
  • transformers: A package for natural language processing (NLP) tasks such as text classification and language generation.
  • xformers: A package for transformer models, which are a type of neural network architecture used in NLP tasks.
  • git+https://github.com/huggingface/accelerate.git: A package for optimizing the training of deep learning models.
!pip install -q diffusers==0.14.0 transformers xformers git+https://github.com/huggingface/accelerate.git
!pip install -q opencv-contrib-python
!pip install -q controlnet_aux

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
import torch

controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16
)

Step 2: Loading our test image in the data

from diffusers import StableDiffusionControlNetPipeline
from diffusers.utils import load_image

image = load_image(
"/content/daniel-craig-007.jpg-303a730.png"
)
image

Step 3: Using Canny edge detection algorithm

This code applies the Canny edge detection algorithm to an image by converting it to a grayscale image with three channels, setting the low and high threshold values, and using the cv2.Canny() function to generate a binary image with white pixels representing edges. The resulting image is then converted back to an Image object using the PIL Image.fromarray() function.

import cv2
from PIL import Image
import numpy as np

image = np.array(image)

low_threshold = 100
high_threshold = 200

image = cv2.Canny(image, low_threshold, high_threshold)
image = image[:, :, None]
image = np.concatenate([image, image, image], axis=2)
canny_image = Image.fromarray(image)
canny_image

Step 4: Using the pipeline to generate outputs based on the input and SD

The output of the pipe object is a set of four images that are generated based on the prompts and the control parameters provided. The prompts are provided as a list of four strings, with each string being a celebrity name followed by a message indicating the desired image qualities. The prompt messages indicate a desire for "best quality" and "extremely detailed" images. The negative_prompt parameter is set to indicate the desired image qualities in a negative way by requesting "monochrome", "lowres", "bad anatomy", "worst quality", and "low quality" images. The generator parameter is used to control the randomness of the generated images, and the num_inference_steps parameter sets the number of iterations used to generate each image.

def image_grid(imgs, rows, cols):
assert len(imgs) == rows * cols

w, h = imgs[0].size
grid = Image.new("RGB", size=(cols * w, rows * h))
grid_w, grid_h = grid.size

for i, img in enumerate(imgs):
grid.paste(img, box=(i % cols * w, i // cols * h))
return grid

prompt = ", best quality, extremely detailed"
prompt = [t + prompt for t in ["Tom Cruise", "Donald Trump", "rihanna", "taylor swift"]]
generator = [torch.Generator(device="cpu").manual_seed(2) for i in range(len(prompt))]

output = pipe(
prompt,
canny_image,
negative_prompt=["monochrome, lowres, bad anatomy, worst quality, low quality"] * len(prompt),
generator=generator,
num_inference_steps=20,
)

image_grid(output.images, 2, 2)
Using the pose and the prompt we are able to generate this output

Step 5: Using OpenposeDetector to detect the pose in the image

This code loads a pre-trained OpenposeDetector model from the Hugging Face model hub using the from_pretrained() method and applies it to an image to detect human poses. It also loads a pre-trained ControlNetModel and a StableDiffusionControlNetPipeline from the diffusers package, sets some control parameters for the pipeline, and enables some memory and performance optimizations.

from controlnet_aux import OpenposeDetector
model = OpenposeDetector.from_pretrained("lllyasviel/ControlNet")
poses = model(image)
#image_grid(poses, 2, 2)
poses
controlnet = ControlNetModel.from_pretrained(
"fusing/stable-diffusion-v1-5-controlnet-openpose", torch_dtype=torch.float16
)
model_id = "runwayml/stable-diffusion-v1-5"
pipe = StableDiffusionControlNetPipeline.from_pretrained(
model_id,
controlnet=controlnet,
torch_dtype=torch.float16,
)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()
pipe.enable_xformers_memory_efficient_attention()

Training a model Combining ControlNet with Stable diffusion

We wanted to see how can we train a ControlNet model on a dataset. So, to achieve our goal of controlling SD, we took on the task to fill colors in circles. Our approach will be to use a simple task to demonstrate how ControlNet can learn task-specific conditions in an end-to-end way. We will use a prompt that describes our target and includes a “Control Image (Source Image).”

Stable diffusion has already been trained on billions of images, and it already knows what colors and shapes are. However, it does not understand the meaning of the “Control Image (Source Image)” in our prompt. By adding ControlNet, we can teach SD to understand this input condition and fill the circle with the specified color.

The task we are trying to achieve

The Dataset

The “fill50k” dataset contains a collection of 50,000 image pairs of circles, which are stored in two folders: “source” and “target”. The “source” folder contains images of circles with lines only, while the “target” folder contains images of circles filled with a specific color.

In addition to the images, the dataset includes a “prompt.json” file, which provides information about each image. Each prompt is in the form of a description of the circle and the color of the background. For example, a prompt may read “a blue circle on a yellow background.”

To use this dataset with PyTorch, you will need to write a script that reads the images and their corresponding prompts from the “fill50k” folder and prepares them for training. This will involve loading the images into PyTorch tensors, creating batches of data, and performing any necessary pre-processing such as normalization or data augmentation.

We have the input and prompt here on which we will train our model
Expected output

Overall, the “fill50k” dataset is designed to be used for training machine learning models to fill circles with specific colors based on the prompts provided in the dataset. It is a useful resource for researchers and developers working on image generation and related tasks using deep learning techniques.

Loading the dataset

import json
import cv2
import numpy as np

from torch.utils.data import Dataset


class MyDataset(Dataset):
def __init__(self):
self.data = []
with open('./training/fill50k/prompt.json', 'rt') as f:
for line in f:
self.data.append(json.loads(line))

def __len__(self):
return len(self.data)

def __getitem__(self, idx):
item = self.data[idx]

source_filename = item['source']
target_filename = item['target']
prompt = item['prompt']

source = cv2.imread('./training/fill50k/' + source_filename)
target = cv2.imread('./training/fill50k/' + target_filename)

# Do not forget that OpenCV read images in BGR order.
source = cv2.cvtColor(source, cv2.COLOR_BGR2RGB)
target = cv2.cvtColor(target, cv2.COLOR_BGR2RGB)

# Normalize source images to [0, 1].
source = source.astype(np.float32) / 255.0

# Normalize target images to [-1, 1].
target = (target.astype(np.float32) / 127.5) - 1.0

return dict(jpg=target, txt=prompt, hint=source)

What SD model to use

Note that all weights inside the ControlNet are also copied from SD so that no layer is trained from scratch, and you are still finetuning the entire model.

Training the model

This Python script uses PyTorch Lightning to train a custom machine learning model on a custom dataset. It loads the pre-trained model, sets the configurations such as batch size, logger frequency, learning rate, and more. It then creates a DataLoader to load data from the custom dataset, creates an image logger for recording the training progress, and sets up the PyTorch Lightning Trainer to handle the training process. Finally, it trains the model using the Trainer and DataLoader, and logs the progress using the ImageLogger.

import pytorch_lightning as pl
from torch.utils.data import DataLoader
from tutorial_dataset import MyDataset
from cldm.logger import ImageLogger
from cldm.model import create_model, load_state_dict


# Configs
resume_path = './models/control_sd15_ini.ckpt'
batch_size = 4
logger_freq = 300
learning_rate = 1e-5
sd_locked = True
only_mid_control = False


# First use cpu to load models. Pytorch Lightning will automatically move it to GPUs.
model = create_model('./models/cldm_v15.yaml').cpu()
model.load_state_dict(load_state_dict(resume_path, location='cpu'))
model.learning_rate = learning_rate
model.sd_locked = sd_locked
model.only_mid_control = only_mid_control


# Misc
dataset = MyDataset()
dataloader = DataLoader(dataset, num_workers=0, batch_size=batch_size, shuffle=True)
logger = ImageLogger(batch_frequency=logger_freq)
trainer = pl.Trainer(gpus=1, precision=32, callbacks=[logger])


# Train!
trainer.fit(model, dataloader)

As the model trains, it learns to combine the prompt and control signal to generate the output image that best matches the target image in the dataset. Once the model is fully trained, it can be used to generate new images by providing it with a prompt and control signal, and the model will generate an output image that corresponds to that input.

Overall, the process of generating an image from a prompt and control signal involves training a deep learning model on a dataset of input-output pairs, and adjusting the model’s parameters so that it learns to combine the inputs to generate the outputs.

Learnings

Data and Model Preparation:

  • Pre-trained models can be useful for a variety of tasks and can be easily integrated into projects.
  • Different pre-trained models are available for controlling different aspects of an image, such as edges, depth, and human poses.
  • Open-source libraries like PyTorch can be used for training and deploying machine learning models.

Image Generation:

  • Stable diffusion is a powerful technique for generating high-quality images based on control inputs.
  • Using a diffusion model with a control network can allow for more precise control over the generated image.
  • Multiple prompts can be used to generate different images based on the same control input.

Image Processing:

  • Image processing techniques like Canny edge detection and OpenPose can be used to extract specific information from images, which can then be used as control inputs for image generation.
  • Different types of image processing techniques are suited for different types of control inputs, such as edges, poses, and outlines.

--

--