On Architectural Compression of Text-to-Image Diffusion Models

5 min readOct 24, 2023

On Architectural Compression of Text-to-Image Diffusion Models

Unofficial implementation: https://github.com/segmind/distill-sd

Text-to-image generation is a fascinating and challenging task that aims to create realistic images from natural language descriptions. Recently, Stable Diffusion have achieved impressive results in this task, but they also come with high computational costs. In this blog post, I will introduce a new paper that proposes a novel method to compress the architecture of Stable Diffusion and make them more efficient without sacrificing quality.

The proposed approach

The paper proposes a novel method to compress the architecture of Stable Diffusion models (SDMs) by removing some residual and attention blocks from the U-Net network that performs diffusion in the latent space.
The paper uses knowledge distillation to transfer the knowledge from the original SDM to the compressed one, using only a small fraction of the training data.
The paper shows that the compressed models, called BK-SDMs, can achieve over 51% reduction in the number of parameters, and 43% improvement in latency on CPU and GPU compared to SDMs, while maintaining competitive results.
The paper also demonstrates the applicability of BK-SDMs in personalized generation with DreamBooth fine-tuning.

Unofficial implementation open-sourced model weights and training codes for two types of architectures: SD Small and SD Tiny.

Train Distill SD with DiffEngine

DiffEngine GitHub: https://github.com/okotaku/diffengine
DiffEngine documentation: https://diffengine.readthedocs.io/en/latest/

In this section, we will share how we implemented Disill SD for the SDXL model.

To apply Distill SD to SDXL, we did the following steps:

We ignored the 4th down/up blocks deletion, because SDXL does not have these blocks. This is different from the original paper, where they applied this deletion to Stable Diffusion Models.
We removed one Attention layer from each U-Net block, except for the first block, which doesn’t have an Attention layer. We also adjusted the distillation operation based on this modification.
We deleted one Residual Block from each U-Net block. This is consistent with the unofficial implementation.
We deleted the middle blocks for Tiny SDXL, which are the blocks between the encoder and decoder of the U-Net. This is also consistent with the unofficial implementation.

By doing these modifications, we were able to obtain a smaller and faster version of SDXL, which we call Distill SDXL.

You can check our implementations.: https://github.com/okotaku/diffengine/blob/main/diffengine/models/editors/distill_sd/distill_sd_xl.py

Installation

Before installing DiffEngine, please ensure that PyTorch has been successfully installed following the official guide.

https://pytorch.org/get-started/locally/

Install DiffEngine

pip install openmim
pip install git+https://github.com/okotaku/diffengine.git

Train Distill SDXL with DiffEngine

A variety of pre-defined configs can be found in the configs directory of the DiffEngine repository.

Distill SDXL Configs: https://github.com/okotaku/diffengine/tree/main/configs/distill_sd

For example, if you wish to train a Tiny SDXL model with the pokemon blip dataset, access the file https://github.com/okotaku/diffengine/blob/main/configs/distill_sd/tiny_sd_xl_pokemon_blip.py.

To train with a selected config, open a terminal and run the following command:

mim train diffengine tiny_sd_xl_pokemon_blip.py

Inference Distill SDXL with diffusers.pipeline

I have uploaded the trained model weights to the Hugging Face Hub. You can utilize it for a inference.

Trained model weight: https://huggingface.co/takuoko/tiny_sd_xl_pokemon_blip

import torch
from diffusers import DiffusionPipeline, UNet2DConditionModel, AutoencoderKL

checkpoint = 'takuoko/tiny_sd_xl_pokemon_blip'
prompt = 'a very cute looking pokemon with a hat on its head'

unet = UNet2DConditionModel.from_pretrained(
    checkpoint, torch_dtype=torch.bfloat16
    )
vae = AutoencoderKL.from_pretrained(
    'madebyollin/sdxl-vae-fp16-fix',
    torch_dtype=torch.bfloat16,
)
pipe = DiffusionPipeline.from_pretrained(
    'stabilityai/stable-diffusion-xl-base-1.0', unet=unet, vae=vae, torch_dtype=torch.bfloat16
    )
pipe.to('cuda')

image = pipe(
    prompt,
    num_inference_steps=50,
).images[0]
image.save('demo.png')

An illustrative output example is provided below:

Train Distill SD DreamBooth with DiffEngine

In this section, we showed the results of personalized generation by fine-tuning DreamBooth using Tiny SD.

Distill SD DreamBooth Configs: https://github.com/okotaku/diffengine/tree/main/configs/distill_sd_dreambooth

Tiny SD Checkpoints: https://huggingface.co/segmind/tiny-sd

To train with https://github.com/okotaku/diffengine/blob/main/configs/distill_sd_dreambooth/small_sd_dreambooth_lora_dog.py, open a terminal and run the following command:

mim train diffengine small_sd_dreambooth_lora_dog.py

Tiny SD reduces training time by nearly 30%.

Inference Distill SD DreamBooth with diffusers.pipeline

Once you have trained a model, simply specify the path to the saved model and inference by the diffusers.pipeline module.

I have uploaded the trained model weights to the Hugging Face Hub. You can utilize it for a inference.

Trained model weight: https://huggingface.co/takuoko/small-sd-dreambooth-lora-dog

import torch
from diffusers import DiffusionPipeline

checkpoint = 'takuoko/small-sd-dreambooth-lora-dog'
prompt = 'A photo of sks dog in a bucket'

pipe = DiffusionPipeline.from_pretrained(
    'segmind/small-sd', torch_dtype=torch.float16)
pipe.to('cuda')
pipe.load_lora_weights(checkpoint, weight_name='pytorch_lora_weights.bin')

image = pipe(
    prompt,
    num_inference_steps=50,
).images[0]
image.save('demo.png')

An illustrative output example is provided below:

Conclusion

DiffEngine supports Distill SD trainings. Let’s take a look it;)

Thank you for reading.

Reference

Open-sourcing Knowledge Distillation Code and Weights of SD-Small and SD-Tiny: https://huggingface.co/blog/sd_distillation
Segmind: https://www.segmind.com/models
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis: https://arxiv.org/abs/2307.01952

On Architectural Compression of Text-to-Image Diffusion Models

On Architectural Compression of Text-to-Image Diffusion Models

The proposed approach

Train Distill SD with DiffEngine

Installation

Train Distill SDXL with DiffEngine

Inference Distill SDXL with diffusers.pipeline

Train Distill SD DreamBooth with DiffEngine

Inference Distill SD DreamBooth with diffusers.pipeline

Conclusion

Sponsors

Reference

Written by takuoko