Sketch-guided Image Generation with Stable Diffusion

Use T2I Adapters and SD XL for sketch to image generation

7 min readNov 25, 2023

--

Stable Diffusion (SD) generates images from text. Amazing images, but they might not always be exactly what you had in mind. Read on to learn how to guide the diffusion process with a sketch using text-to-image (T2I) adapters.

Sketch-guided image generation. Prompt “a robot dog in real world, 4k photo, highly detailed”

I began my personal journey into the SD world with a drawing from my girlfriend.

a sketch from my girlfriend turned into .. something, at least. Tools: stable-diffusion-xl-base-1.0 with t2iadapter sketch_sdxl_1.0

Picture this: you have a vision in your mind, and you try to convey it with a prompt like “tree, swing chair, sun, water, and a few fish”. You quickly realise that it’s challenging to capture all the intricacies of your imagination in a string of words.

This tutorial will show you how to overcome this:

Use Stable Diffusion XL and text-to-image (T2I) adapters to
generate a variation of your input image guided by a text prompt
Tune the most important parameters
Focused on Python code, parameters, and example images — rather than handling of the popular SD UIs (they gave me a headache)

T2I Adapters

The core idea is to take an existing image and transfer some of its elements to a new image.

Prompt “a robot dog in real world, 4k photo, highly detailed”

Text-to-image (T2I) adapters were introduced in early 2023 (paper). They guide the diffusion process using another image in addition to the text prompt. There are various adapters for different purposes, read more on Hugging Face. We will use the sketch adapter sketch_sdxl_1.0.

Figure 1 of T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models by Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhongang Qi, Ying Shan, Xiaohu Qie.

I won’t go into other aspects of Stable Diffusion here, as there are plenty of resources available: Start with the Reddit Wiki.

Code 🚀

Let’s dive right in.

Prerequisites

pip install these packages:

controlnet-aux            0.0.7
diffusers                 0.23.1
open-clip-torch           2.20.0
Pillow                    9.5.0
pytorch-lightning         1.9.4
torch                     2.1.0
torch-grammar             0.3.3
torchaudio                2.1.0
torchdiffeq               0.2.3
torchmetrics              1.2.0
torchsde                  0.2.5
torchvision               0.16.0
transformers              4.30.2

Python version 3.10.12, running on Ubuntu 22.04.3 LTS. All of this was done on a single Nvidia GeForce RTX 3090 GPU.

Load the input image

We’ll begin with an image of a dog, a German Shepherd.

from PIL import Image
import io, requests

# Load a  random image from a URL
url="http://..."
image_input = Image.open(io.BytesIO(requests.get(url, stream=True).content))

Preprocess image: Convert to sketch

We will use Pixel Difference Network (PiDiNet) here, it’s a fast and efficient system for finding edges in images. PiDiNet uses a blend of traditional edge detection methods like Canny and Sobel.

from controlnet_aux.pidi import PidiNetDetector

# Initialize the PiDiNet Detector for edge detection
preprocessor = PidiNetDetector.from_pretrained("lllyasviel/Annotators").to("cuda")

# Preprocess the image to convert it into a sketch-like format
image_preprocessed = preprocessor(
  image_input, 
  detect_resolution=1024, 
  image_resolution=1024,
  apply_filter=True).convert("L")

Turning the input into a sketch-like image with PidiNetDetector

Other preprocessors are available, some of them for different purposes than sketch-guidance, read more here.

For this tutorial we will stay with the PidiNetDetector and sketches.

Load the SD model and t2i adapters

We’ll start by loading the stabilityai/stable-diffusion-xl-base-1.0 model and t2i adapter sketch_sdxl_1.0, code taken from the Hugging Face TencentARC/t2i-adapter page.

import torch
from diffusers import (
    T2IAdapter,
    StableDiffusionXLAdapterPipeline,
    DDPMScheduler,
    AutoencoderKL,
    EulerAncestralDiscreteScheduler
)
from diffusers.utils import load_image

# the SD XL model
model_id = "stabilityai/stable-diffusion-xl-base-1.0"

# load the T2I adapter
adapter = T2IAdapter.from_pretrained(
    "Adapter/t2iadapter", 
    subfolder="sketch_sdxl_1.0", 
    torch_dtype=torch.float16, 
    adapter_type="full_adapter_xl")

# load variational autoencoder (VAE)
vae = AutoencoderKL.from_pretrained(
    "madebyollin/sdxl-vae-fp16-fix", 
    torch_dtype=torch.float16)

# load scheduer
euler_a = EulerAncestralDiscreteScheduler.from_pretrained(
    model_id, 
    subfolder="scheduler")

# instantiate HF pipeline to combine all the components
pipe = StableDiffusionXLAdapterPipeline.from_pretrained(
    model_id, 
    adapter=adapter, 
    vae=vae,
    scheduler=euler_a,
    torch_dtype=torch.float16, 
    variant="fp16", 
).to("cuda")

Diffuse

Now, let’s generate an image. As a prompt we use “a robot dog in real world, 4k photo, highly detailed”.

image = pipe(
        prompt="a robot dog in real world, 4k photo, highly detailed",
        negative_prompt=f"disfigured, extra digit, fewer digits, 
                        cropped, worst quality, low quality",  # Negative prompt to avoid unwanted traits
        image=image_preprocessed,  # input sketch 
        guidance_scale=7.5,                 # we will talk about this one
        num_inference_steps=40,             # and this one
        adapter_conditioning_scale=0.9,     # yes
        adapter_conditioning_factor=0.9,    # same
    ).images[0]

And here’s the result — a robot dog!

Final image. Input: sketch + “*a robot dog in real world, 4k photo, highly detailed*”

Now, let’s explore how to control the diffusion process. We’ll go through each parameter and see how it changes the final image.

Generation parameters

Inference Steps

The most crucial parameter is num_inference_steps. Stable diffusion starts with an image of random noise and steps towards the final image. The more inference steps, the better the final image will be, and the longer it will take.

Guidance Scale

guidance_scale determines how much the prompt influences the image.

A value of 0 results in random images unrelated to the prompt.
Lower values generate more creative, but sometimes unrelated, images.
Higher values produce images that closely match the prompt but may lack creativity.
The optimal range is typically between 5-15.

You’ll notice that when guidance_scale is very low, the "robot" in our robot dog disappears.

Increasing the influence of the prompt on the final image by increasing guidance_scale

In a similar way, the prompt itself can be tuned down by making it more generic. For example, if we use a prompt like “high-quality image” (without telling the model that we want to see a dog), the model might generate something completely unexpected.

Prompt “highly detailed” (no mention of a dog)

Adapter Conditioning Scale and Factor

These parameters control how much the sketch guides the final image.

adapter_conditioning_scale: Determines the influence of the adapter on image generation. High values mean stronger conditioning.
adapter_conditioning_factor: Specifies the fraction of timesteps for which the adapter is applied. 0.5 means that the adapter is applied to the first half of the timesteps only, with 1.0 it is applied to all of them.

0.9 for both is a good start in my experience. In practice you will simply have to try and see how different values change your image.

Let’s see how these parameters mess with our robot dog. Here’s one sweep for adapter_conditioning_scale and one for adapter_conditioning_factor.

Effect of **adapter_conditioning_scale.** 0=prompt only. With increasing values, the pose of the dog is respected, subsequently details like the position of ears and legs make it into the final image.

Effect of `adapter_conditioning_factor. While` adapter_conditioning_scale leads to gradual changes, the effect of `adapter_conditioning_factor is more drastic and hardly any change is seen with values >0.5`

It’s interesting to note that for adapter_conditioning_factor values >0.5, the final image remains the same. Increasing adapter_conditioning_scale leads to more continuous and subtle changes.

This pattern holds for various images, not just the robot dog, as I will show you in just a second.

Let’s do a more thorough sweep first.

Still, same picture here, no change for adapter_conditioning_factor values >0.5.

Maybe it’s something about this particular image? Try Jimi Hendrix.

Same. So, for the examples we’ve explored, the provided parameter values seem to be good starting points.

However, keep in mind that optimal values may vary depending on your specific goals, and the devil is often in the details when fine-tuning for desired outcomes.

Summary

You now possess the fundamental tools necessary to create images using SD XL and T2I adapters.

Stable diffusion is an amazing AI tool, it’s an entire universe of tools, rapidly expanding. If you don’t feel it yet, check the website Civitai
Generative AI is easy, even without a fancy user interface
T2I adapters are a great innovation, providing additional control over the image generation process beyond the text prompt
Use the T2I sketch-adapter to guide the diffusion process from an input image

I gathered all the code in a notebook. Use it to build your own generative AI pipeline. Or even an app, I provide a basic implementation to get you started, built with Streamlit.

I hope you enjoyed this story! If you have any feedback, additional ideas, or questions, feel free to leave a comment here or reach out on Twitter.