Faster Stable Diffusion Inference with Intel Extension for Transformers

Faster, High-Quality Stable Diffusion on Intel Platforms

Published in

Intel Analytics Software

4 min readJul 27, 2023

Xinyu Ye, Haihao Shen, and Hanwen Chang, Intel Corporation

We previously demonstrated how to leverage post-training quantization (PTQ) of Intel Neural Compressor to accelerate the inference of Stable Diffusion fine-tuned on the Pokémon dataset:

Accelerate Stable Diffusion with Intel Neural Compressor

Faster Inference through 8-Bit Post-Training Quantization on Intel Platforms

medium.com

In this blog, we will show how to combine quantization-aware training with knowledge distillation to quantize the UNet of the pretrained Stable Diffusion on Intel platforms to achieve better inference performance at comparable output image quality compared to fp32 counterparts.

Quantization-Aware Training

Quantization-aware training emulates inference-time quantization in the forward pass of the training process by inserting fake quant ops before those quantizable ops. With quantization-aware training, all weights and activations are fake quantized during the forward and backward passes (i.e., float values are rounded to mimic int8 values but all computations are still done with floating point numbers). Thus, all the weight adjustments during training are made while aware of the fact that the model will ultimately be quantized. After quantizing, this method usually yields higher accuracy than post-training quantization.

As shown in our previous article, inference of UNet constitutes the majority computation of the stable diffusion pipeline, among which convolution and linear operations compose the majority computation of the UNet. Therefore, we only quantize the convolution and linear operations of the UNet to achieve a better tradeoff between inference performance and output quality, because even though quantization can speed up the inference, it usually degrades output quality.

Knowledge Distillation

Knowledge distillation is a popular approach to network compression, which transfers knowledge from a large model to a smaller one without loss of validity. As smaller models are less expensive to evaluate, they can be deployed on less powerful hardware (such as a mobile device). In this example, we use fp32 UNet as the teacher model, fake quantized UNet as the student model, and transfer the knowledge of the teacher to the student through the mean squared error between their outputs. The diagram below shows the workflow for combining quantization-aware training and knowledge distillation:

Quantization Parameter Initialization

During quantization-aware training, quantization parameters like scale and zero point are needed to do fake quantization. At the start of training, scale is usually initialized to 1 and zero point to 0, but such initialization often causes immense quantization error that means a bad starting point for quantization-aware training, which causes slower convergence or even divergence. Inspired by post-training quantization, we introduced quantization parameter initialization to create a good starting point for quantization-aware training by doing initial calibration. Our experiments show that this step gives better output image quality for the same number of training steps.

Mixed Precision on Denoising Loop

During inference of the stable diffusion pipeline, UNet is being used in the denoising loop to denoise the latent image in N steps, normally N=50. We found that the quality of the generated image can be improved by applying mixed precision on the quantized UNet model at the beginning and ending steps of denoising loops (e.g., the first and last k steps where k≥3).

Example

The code snippets below show how to leverage quantization-aware training along with knowledge distillation to optimize the UNet. The complete code is available here (see readme for documentation).

from quantization_modules import find_and_replace, convert2quantized_model
find_and_replace(unet) # replace modules to fake quantization modules
# KD
from neural_compressor.config import DistillationConfig, IntermediateLayersKnowledgeDistillationLossConfig
teacher_model = unet_fp32
layer_mappings = [
    [
        [
            "", lambda x: x.sample 
        ]
    ],
]
distillation_criterion = IntermediateLayersKnowledgeDistillationLossConfig(
    layer_mappings=layer_mappings,
    loss_types=["MSE"] * len(layer_mappings),
    loss_weights=[1.0 / len(layer_mappings)] * len(layer_mappings),
    add_origin_loss=True,
)
d_conf = DistillationConfig(teacher_model=teacher_model, criterion=distillation_criterion)

from neural_compressor.training import prepare_compression
compression_manager = prepare_compression(unet, d_confs)
compression_manager.callbacks.on_train_begin()
for epoch in range(num_train_epochs):
    for step, batch in enumerate(train_dataloader):
        ...
        loss = compression_manager.callbacks.on_after_compute_loss(unet_inputs, model_pred, loss)
        ...
compression_manager.callbacks.on_train_end()
unet = convert2quantized_model(unet) # convert fake quantization modules to quantized modules

We did experiments on the runwayml/stable-diffusion-v1–5 Stable Diffusion model with a small portion of the LAION-400M dataset for training as well as for quantization parameter initialization; specifically, 20,000 image and text pairs. We trained the UNet with quantization-aware training and knowledge distillation for only 300 steps, with a total training batch size of four. The resulting quantized UNet along with the other components of the Stable Diffusion pipeline can produce the images with quality comparable to that of its fp32 counterparts.

The results of the original Stable Diffusion, Stable Diffusion with quantized UNet, and Stable Diffusion with quantized UNet and mixed precision (k=3) are shown in the left, middle, and right panes, respectively:

Prompt: “The Milky Way lies in the sky, with the golden snow mountain lies below, high definition”

We also measured FID scores on the COCO2017 validation set for objective comparison. The results are shown in the following table:

Summary

We have released the source code in Intel Extension for Transformers for both training and evaluation. We encourage you to try it out on Intel CPUs and explore other Intel AI tools and optimizations as part of your AI workflows. Please add a star to the Intel Extension for Transformers repository if you would like to receive notifications about our latest optimizations. You are also welcome to create pull requests or submit issues to the repository. Feel free to contact us if you have any questions.