Using Civitai Models with Diffusers
Pipeline for generating synthetic images on Google Colab.
In previous articles we covered using the diffusers package to run stable diffusion models, upscaling images with Real-ESRGAN, using long prompts and CLIP skip with the diffusers package — all of these steps must be combined together to obtain high resolution and high quality outputs from stable diffusion.
In this article we put everything together into a single pipline and demonstrate how to download and use models from Civitai with the diffusers package.
Using Civitai Models with Diffusers
Civitai is a great place to hunt for all sorts of stable diffusion models trained by the community. Although these models are typically used with UIs, with a bit of work they can be used with the diffusers
package as well.
Package Versions
For this article we will be using the following versions of the following packages.
pip install -q transformers==4.31.0
pip install -q accelerate==0.21.0
pip install -q diffusers==0.20.0
pip install -q huggingface_hub==0.16.4
pip install -q omegaconf==2.3.0
Preparing the Environment
First of all, we load the necessary packages, and we check if a GPU is available for use — stable diffusion models should be run on the GPU! Also, half precision will be used on the GPU.
import diffusers
import transformers
import sys
import os
import shutil
import time
import torch
import matplotlib.pyplot as plt
import numpy as np
from PIL import Image
if torch.cuda.is_available():
device_name = torch.device("cuda")
torch_dtype = torch.float16
else:
device_name = torch.device("cpu")
torch_dtype = torch.float32
Next, we download the model safetensors file from Civitai. For this article we use this model which generates beautiful landscape scenes in the style of Chinese watercolour paintings released by Celsia.
wget https://civitai.com/api/download/models/130803 --content-disposition
We will need to convert the downloaded safetensors file into another format which can be used with diffusers. For this conversion we download and use a Python script provided on HuggingFace’s GitHub repository. Note that the version of this script used should match the version of diffusers
, which in this case is 0.20.0
.
wget https://raw.githubusercontent.com/huggingface/diffusers/v0.20.0/scripts/convert_original_stable_diffusion_to_diffusers.py
python convert_original_stable_diffusion_to_diffusers.py
--checkpoint_path ChineseLandscapeArt_v10.safetensors
--dump_path ChineseLandscapeArt_v10/
--from_safetensors
Using the downloaded conversion script, the converted model files will be saved under the directory ChineseLandscapeArt_v10/
. These files are now ready for loading by the diffuser’s stable diffusion pipeline.
CLIP Skip
Next are the functions to implement CLIP skip with diffusers as detailed by Patrick von Platen on GitHub to skip layers in the CLIP text encoder. For now we use CLIP skip = 1 which uses all layers in the CLIP text encoder.
# Follows community convention.
# Clip skip = 1 uses the all text encoder layers.
# Clip skip = 2 skips the last text encoder layer.
clip_skip = 1
if clip_skip > 1:
text_encoder = transformers.CLIPTextModel.from_pretrained(
"runwayml/stable-diffusion-v1-5",
subfolder = "text_encoder",
num_hidden_layers = 12 - (clip_skip - 1),
torch_dtype = torch_dtype
)
Using the modified text encoder, the converted model files are loaded into diffuser’s diffusion pipeline.
# Load the pipeline.
model_path = "ChineseLandscapeArt_v10"
if clip_skip > 1:
# TODO clean this up with the condition below.
pipe = diffusers.DiffusionPipeline.from_pretrained(
model_path,
torch_dtype = torch_dtype,
safety_checker = None,
text_encoder = text_encoder,
)
else:
pipe = diffusers.DiffusionPipeline.from_pretrained(
model_path,
torch_dtype = torch_dtype,
safety_checker = None
)
pipe = pipe.to(device_name)
# Change the pipe scheduler to EADS.
pipe.scheduler = diffusers.EulerAncestralDiscreteScheduler.from_config(
pipe.scheduler.config
)
Prompt Embeddings for Long Prompts
Stable diffusion’s CLIP text encoder as a limit of 77 tokens and will truncate encoded prompts longer than this limit — prompt embeddings are required to overcome this limitation. We use a modified version of the solution proposed by Andre van Zuydam on GitHub to create prompt embeddings with the diffusers pipeline.
# Prompt embeddings to overcome CLIP 77 token limit.
# https://github.com/huggingface/diffusers/issues/2136
def get_prompt_embeddings(
pipe,
prompt,
negative_prompt,
split_character = ",",
device = torch.device("cpu")
):
max_length = pipe.tokenizer.model_max_length
# Simple method of checking if the prompt is longer than the negative
# prompt - split the input strings using `split_character`.
count_prompt = len(prompt.split(split_character))
count_negative_prompt = len(negative_prompt.split(split_character))
# If prompt is longer than negative prompt.
if count_prompt >= count_negative_prompt:
input_ids = pipe.tokenizer(
prompt, return_tensors = "pt", truncation = False
).input_ids.to(device)
shape_max_length = input_ids.shape[-1]
negative_ids = pipe.tokenizer(
negative_prompt,
truncation = False,
padding = "max_length",
max_length = shape_max_length,
return_tensors = "pt"
).input_ids.to(device)
# If negative prompt is longer than prompt.
else:
negative_ids = pipe.tokenizer(
negative_prompt, return_tensors = "pt", truncation = False
).input_ids.to(device)
shape_max_length = negative_ids.shape[-1]
input_ids = pipe.tokenizer(
prompt,
return_tensors = "pt",
truncation = False,
padding = "max_length",
max_length = shape_max_length
).input_ids.to(device)
# Concatenate the individual prompt embeddings.
concat_embeds = []
neg_embeds = []
for i in range(0, shape_max_length, max_length):
concat_embeds.append(
pipe.text_encoder(input_ids[:, i: i + max_length])[0]
)
neg_embeds.append(
pipe.text_encoder(negative_ids[:, i: i + max_length])[0]
)
return torch.cat(concat_embeds, dim = 1), torch.cat(neg_embeds, dim = 1)
Running the Diffusion Pipeline
Finally all the fundamental building blocks are in place for us to run the diffusion pipeline with the model downloaded from Civitai.
Prompt and Prompt Embeddings
As always, the first step in the pipeline is to create the prompts and negative prompts, as well as the corresponding prompt embeddings.
prompt = """beautiful Chinese Landscape Art, best quality, intricate,
water colors, snowy mountains, glacier, snow, starry night sky, stars,
milkyway"""
negative_prompt = """deformed, weird, bad resolution, bad depiction,
not Chinese style, weird, has people, worst quality, worst resolution,
too blurry, not relevant"""
prompt_embeds, negative_prompt_embeds = get_prompt_embeddings(
pipe,
prompt,
negative_prompt,
split_character = ",",
device = device_name
)
Generating Synthetic Images
Next is the actual loop for generating synthetic images with all the different parts put together.
For now we generated 10 images over 20 steps, with a guidance scale of 7. Also, the output images will have a resolution of 768×512 — we will upscale them to a higher resolution with Real-ESRGAN later on.
# Set to True to use prompt embeddings, and False to
# use the prompt strings.
use_prompt_embeddings = True
# Seed and batch size.
start_idx = 0
batch_size = 10
seeds = [i for i in range(start_idx , start_idx + batch_size, 1)]
# Number of inference steps.
num_inference_steps = 20
# Guidance scale.
guidance_scale = 7
# Image dimensions - limited to GPU memory.
width = 768
height = 512
images = []
for count, seed in enumerate(seeds):
start_time = time.time()
if use_prompt_embeddings is False:
new_img = pipe(
prompt = prompt,
negative_prompt = negative_prompt,
width = width,
height = height,
guidance_scale = guidance_scale,
num_inference_steps = num_inference_steps,
num_images_per_prompt = 1,
generator = torch.manual_seed(seed),
).images
else:
new_img = pipe(
prompt_embeds = prompt_embeds,
negative_prompt_embeds = negative_prompt_embeds,
width = width,
height = height,
guidance_scale = guidance_scale,
num_inference_steps = num_inference_steps,
num_images_per_prompt = 1,
generator = torch.manual_seed(seed),
).images
images = images + new_img
Visualizing the Generated Images
The generated images are easily visualized with the function below.
# Plot pipeline outputs.
def plot_images(images, labels = None):
N = len(images)
n_cols = 5
n_rows = int(np.ceil(N / n_cols))
plt.figure(figsize = (20, 5 * n_rows))
for i in range(len(images)):
plt.subplot(n_rows, n_cols, i + 1)
if labels is not None:
plt.title(labels[i])
plt.imshow(np.array(images[i]))
plt.axis(False)
plt.show()
plot_images(images, seeds[:len(images)])
In particular, image 5 looks quite nice, however it has a dimension of 768×512 which is pretty small. Let’s upscale it using Real-ESRGAN!
Upscaling Images with Real-ESRGAN
We use this Real-ESRGAN space created by doevent on HuggingFace to upscale the images output by the diffusion pipeline. This space runs on the T4 GPU making it quite fast.
Running the image through Real-ESRGAN twice with an upscaling factor of 2× results in an image 4× the original size!
Summary
In this article we demonstrated a pipeline for using Civitai models with the diffusers
package, from downloading to converting the model, to implementing CLIP skip and prompt embeddings. Real-ESRGAN was then used at the end to upscale the outputs from the diffusion pipeline. The full pipeline has been released as a Jupyter notebook on my GitHub repository. Thank you for reading!
References
- https://huggingface.co/docs/diffusers/index
- https://github.com/huggingface/diffusers/issues/3212
- https://github.com/huggingface/diffusers/issues/2136
- https://civitai.com/models/120298/chinese-landscape-art