Using Civitai Models with Diffusers

Pipeline for generating synthetic images on Google Colab.

6 min readAug 11, 2023

In previous articles we covered using the diffusers package to run stable diffusion models, upscaling images with Real-ESRGAN, using long prompts and CLIP skip with the diffusers package — all of these steps must be combined together to obtain high resolution and high quality outputs from stable diffusion.

In this article we put everything together into a single pipline and demonstrate how to download and use models from Civitai with the diffusers package.

Snowy mountains under the milky way. Image generated by the author using: https://civitai.com/models/120298/chinese-landscape-art.

Using Civitai Models with Diffusers

Civitai is a great place to hunt for all sorts of stable diffusion models trained by the community. Although these models are typically used with UIs, with a bit of work they can be used with the diffusers package as well.

Package Versions

For this article we will be using the following versions of the following packages.

pip install -q transformers==4.31.0
pip install -q accelerate==0.21.0
pip install -q diffusers==0.20.0
pip install -q huggingface_hub==0.16.4
pip install -q omegaconf==2.3.0

Preparing the Environment

First of all, we load the necessary packages, and we check if a GPU is available for use — stable diffusion models should be run on the GPU! Also, half precision will be used on the GPU.

import diffusers
import transformers

import sys
import os
import shutil
import time

import torch
import matplotlib.pyplot as plt
import numpy as np

from PIL import Image

if torch.cuda.is_available():
    device_name = torch.device("cuda")
    torch_dtype = torch.float16
else:
    device_name = torch.device("cpu")
    torch_dtype = torch.float32

Next, we download the model safetensors file from Civitai. For this article we use this model which generates beautiful landscape scenes in the style of Chinese watercolour paintings released by Celsia.

wget https://civitai.com/api/download/models/130803 --content-disposition

We will need to convert the downloaded safetensors file into another format which can be used with diffusers. For this conversion we download and use a Python script provided on HuggingFace’s GitHub repository. Note that the version of this script used should match the version of diffusers, which in this case is 0.20.0.

wget https://raw.githubusercontent.com/huggingface/diffusers/v0.20.0/scripts/convert_original_stable_diffusion_to_diffusers.py

python convert_original_stable_diffusion_to_diffusers.py 
--checkpoint_path ChineseLandscapeArt_v10.safetensors 
--dump_path ChineseLandscapeArt_v10/ 
--from_safetensors

Using the downloaded conversion script, the converted model files will be saved under the directory ChineseLandscapeArt_v10/. These files are now ready for loading by the diffuser’s stable diffusion pipeline.

CLIP Skip

Next are the functions to implement CLIP skip with diffusers as detailed by Patrick von Platen on GitHub to skip layers in the CLIP text encoder. For now we use CLIP skip = 1 which uses all layers in the CLIP text encoder.

# Follows community convention.
# Clip skip = 1 uses the all text encoder layers.
# Clip skip = 2 skips the last text encoder layer.

clip_skip = 1

if clip_skip > 1:
    text_encoder = transformers.CLIPTextModel.from_pretrained(
        "runwayml/stable-diffusion-v1-5",
        subfolder = "text_encoder",
        num_hidden_layers = 12 - (clip_skip - 1),
        torch_dtype = torch_dtype
    )

Using the modified text encoder, the converted model files are loaded into diffuser’s diffusion pipeline.

# Load the pipeline.

model_path = "ChineseLandscapeArt_v10"

if clip_skip > 1:
    # TODO clean this up with the condition below.
    pipe = diffusers.DiffusionPipeline.from_pretrained(
        model_path,
        torch_dtype = torch_dtype,
        safety_checker = None,
        text_encoder = text_encoder,
    )
else:
    pipe = diffusers.DiffusionPipeline.from_pretrained(
        model_path,
        torch_dtype = torch_dtype,
        safety_checker = None
    )

pipe = pipe.to(device_name)

# Change the pipe scheduler to EADS.
pipe.scheduler = diffusers.EulerAncestralDiscreteScheduler.from_config(
    pipe.scheduler.config
)

Prompt Embeddings for Long Prompts

Stable diffusion’s CLIP text encoder as a limit of 77 tokens and will truncate encoded prompts longer than this limit — prompt embeddings are required to overcome this limitation. We use a modified version of the solution proposed by Andre van Zuydam on GitHub to create prompt embeddings with the diffusers pipeline.

# Prompt embeddings to overcome CLIP 77 token limit.
# https://github.com/huggingface/diffusers/issues/2136

def get_prompt_embeddings(
    pipe,
    prompt,
    negative_prompt,
    split_character = ",",
    device = torch.device("cpu")
):
    max_length = pipe.tokenizer.model_max_length
    # Simple method of checking if the prompt is longer than the negative
    # prompt - split the input strings using `split_character`.
    count_prompt = len(prompt.split(split_character))
    count_negative_prompt = len(negative_prompt.split(split_character))

    # If prompt is longer than negative prompt.
    if count_prompt >= count_negative_prompt:
        input_ids = pipe.tokenizer(
            prompt, return_tensors = "pt", truncation = False
        ).input_ids.to(device)
        shape_max_length = input_ids.shape[-1]
        negative_ids = pipe.tokenizer(
            negative_prompt,
            truncation = False,
            padding = "max_length",
            max_length = shape_max_length,
            return_tensors = "pt"
        ).input_ids.to(device)

    # If negative prompt is longer than prompt.
    else:
        negative_ids = pipe.tokenizer(
            negative_prompt, return_tensors = "pt", truncation = False
        ).input_ids.to(device)
        shape_max_length = negative_ids.shape[-1]
        input_ids = pipe.tokenizer(
            prompt,
            return_tensors = "pt",
            truncation = False,
            padding = "max_length",
            max_length = shape_max_length
        ).input_ids.to(device)

    # Concatenate the individual prompt embeddings.
    concat_embeds = []
    neg_embeds = []
    for i in range(0, shape_max_length, max_length):
        concat_embeds.append(
            pipe.text_encoder(input_ids[:, i: i + max_length])[0]
        )
        neg_embeds.append(
            pipe.text_encoder(negative_ids[:, i: i + max_length])[0]
        )

    return torch.cat(concat_embeds, dim = 1), torch.cat(neg_embeds, dim = 1)

Running the Diffusion Pipeline

Finally all the fundamental building blocks are in place for us to run the diffusion pipeline with the model downloaded from Civitai.

Prompt and Prompt Embeddings

As always, the first step in the pipeline is to create the prompts and negative prompts, as well as the corresponding prompt embeddings.

prompt = """beautiful Chinese Landscape Art, best quality, intricate,
water colors, snowy mountains, glacier, snow, starry night sky, stars,
milkyway"""


negative_prompt = """deformed, weird, bad resolution, bad depiction,
not Chinese style, weird, has people, worst quality, worst resolution,
too blurry, not relevant"""


prompt_embeds, negative_prompt_embeds = get_prompt_embeddings(
    pipe,
    prompt,
    negative_prompt,
    split_character = ",",
    device = device_name
)

Generating Synthetic Images

Next is the actual loop for generating synthetic images with all the different parts put together.

For now we generated 10 images over 20 steps, with a guidance scale of 7. Also, the output images will have a resolution of 768×512 — we will upscale them to a higher resolution with Real-ESRGAN later on.

# Set to True to use prompt embeddings, and False to
# use the prompt strings.
use_prompt_embeddings = True

# Seed and batch size.
start_idx = 0
batch_size = 10
seeds = [i for i in range(start_idx , start_idx + batch_size, 1)]

# Number of inference steps.
num_inference_steps = 20

# Guidance scale.
guidance_scale = 7

# Image dimensions - limited to GPU memory.
width  = 768
height = 512

images = []

for count, seed in enumerate(seeds):
    start_time = time.time()

    if use_prompt_embeddings is False:
        new_img = pipe(
            prompt = prompt,
            negative_prompt = negative_prompt,
            width = width,
            height = height,
            guidance_scale = guidance_scale,
            num_inference_steps = num_inference_steps,
            num_images_per_prompt = 1,
            generator = torch.manual_seed(seed),
        ).images
    else:
        new_img = pipe(
            prompt_embeds = prompt_embeds,
            negative_prompt_embeds = negative_prompt_embeds,
            width = width,
            height = height,
            guidance_scale = guidance_scale,
            num_inference_steps = num_inference_steps,
            num_images_per_prompt = 1,
            generator = torch.manual_seed(seed),
        ).images


    images = images + new_img

Visualizing the Generated Images

The generated images are easily visualized with the function below.

# Plot pipeline outputs.
def plot_images(images, labels = None):
    N = len(images)
    n_cols = 5
    n_rows = int(np.ceil(N / n_cols))

    plt.figure(figsize = (20, 5 * n_rows))
    for i in range(len(images)):
        plt.subplot(n_rows, n_cols, i + 1)
        if labels is not None:
            plt.title(labels[i])
        plt.imshow(np.array(images[i]))
        plt.axis(False)
    plt.show()

plot_images(images, seeds[:len(images)])

In particular, image 5 looks quite nice, however it has a dimension of 768×512 which is pretty small. Let’s upscale it using Real-ESRGAN!

Upscaling Images with Real-ESRGAN

We use this Real-ESRGAN space created by doevent on HuggingFace to upscale the images output by the diffusion pipeline. This space runs on the T4 GPU making it quite fast.

Running the image through Real-ESRGAN twice with an upscaling factor of 2× results in an image 4× the original size!

Snowy mountains under the milky way. Image generated by the author using: https://civitai.com/models/120298/chinese-landscape-art and upscaled using https://huggingface.co/spaces/doevent/Face-Real-ESRGAN.

Summary

In this article we demonstrated a pipeline for using Civitai models with the diffusers package, from downloading to converting the model, to implementing CLIP skip and prompt embeddings. Real-ESRGAN was then used at the end to upscale the outputs from the diffusion pipeline. The full pipeline has been released as a Jupyter notebook on my GitHub repository. Thank you for reading!

Using Civitai Models with Diffusers

Pipeline for generating synthetic images on Google Colab.

Using Civitai Models with Diffusers

Package Versions

Preparing the Environment

CLIP Skip

Prompt Embeddings for Long Prompts

Running the Diffusion Pipeline

Prompt and Prompt Embeddings

Generating Synthetic Images

Visualizing the Generated Images

Upscaling Images with Real-ESRGAN

Summary

References

WRITER at MLearning.ai / AI Movie Director /imagine AI 3D Models

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

Written by Y. Natsume