Optimizing Latent Consistency Model for Image Generation with OpenVINO and NNCF

OpenVINO™ toolkit
OpenVINO-toolkit
Published in
6 min readJan 1, 2022

Authors: Liubov Talamanova, Ekaterina Aidova, Alexander Kozlov

Introduction

Latent Diffusion Models (LDMs) make a revolution in AI-generated art. This technology enables the creation of high-quality images simply by writing a text prompt. While LDMs, like Stable Diffusion, can achieve outstanding quality of generation, they often suffer from the slowness of the iterative image denoising process. Latent Consistency Model (LCM) is an optimized version of LDM. Inspired by Consistency Models (CM), LCMs enable swift inference with minimal steps on any pre-trained LDMs. The Consistency Model is a new family of generative models that enables one-step or few-step generation. More details about the proposed approach and models can be found using the following resources: project page, paper, and repository.

Similar to the original Stable Diffusion pipeline, the LCM pipeline consists of three important parts:

  • Text Encoder to create a condition to generate an image from a text prompt.
  • U-Net for step-by-step denoising latent image representation.
  • Autoencoder (VAE) for decoding latent space in the image.

In this post, we explain how to optimize the LCM inference by using OpenVINO™ on Intel® hardware. Since LCM is trained in a way that it should be resistant to perturbations, it means that we can also apply common optimization methods such as quantization to lower the precision while expecting a consistent generation result. We apply 8-bit Post-training Quantization from a Neural Network Compression Framework (NNCF).

Convert models to OpenVINO format

To leverage efficient inference with OpenVINO runtime on the Intel platform, the original model should be converted to OpenVINO Intermediate Representation (IR). OpenVINO supports the conversion of PyTorch models directly via Model Conversion API. ov.convert_model function accepts an instance of the PyTorch model and example inputs for tracing and returns the object of ov.Model class, ready to use or save on disk using ov.save_model function. You can find conversion details of LCM in the OpenVINO LCM Notebook.

Processing time of the diffusion model

The diffusion pipeline requires multiple iterations to generate an image. Each iteration requires a non-negligible amount of time, depending on your inference device. We have benchmarked the stable diffusion pipeline on an Intel(R) Core(TM) i9–10980XE CPU @ 3.00GHz. The number of iterations was set at 10.

Benchmarking results:

Average Latency: 6.54 seconds

Encoding Phase:
Text encoding: 0.05 seconds

Denoising Loop : 4.28 seconds
U-Net part (4 iterations): 4.27 seconds
Scheduler: 0.01 seconds

Decoding Phase:
VAE decoding: 2.21 seconds

The U-Net part of the denoising loop takes more than 60% of the full pipeline execution time. That is why the computation cost and speed of the U-Net denoising become the critical path in the pipeline.

In this blog, we use Neural Network Compression Framework (NNCF) Post-Training Quantization (PTQ) API to quantize the U-Net model, which can further boost the model inference while keeping acceptable accuracy without fine-tuning. Quantizing the rest of the diffusion pipeline does not significantly improve inference performance but can lead to substantial degradation of the accuracy.

Quantization

The quantization process includes the following steps:

  1. Create a calibration dataset for the quantization.
  2. Run nncf.quantize to obtain a quantized model.
  3. Save the Int8 model using the ov.save_model function.‍

You can look at the dataset preparation for the U-Net model in OpenVINOthe LCM Notebook. General rules about dataset preparation can be found in OpenVINO documentation.

For Int8 quantization of LCM, we found some useful tricks to mitigate accuracy degradation caused by accuracy-sensitive layers:

  • The U-Net part of the LCM pipeline has a backbone with a transformer that operates on latent patches. To better preserve accuracy after NNCF PTQ, we should pass model_type=nncf.ModelType.Transformer to nncf.quantize function. It keeps several accuracy-sensitive layers in FP16 precision.
  • Default symmetric quantization of weights and activations also leads to accuracy degradation of LCM. We recommend using preset=nncf.QuantizationPreset.MIXED uses symmetric quantization of weights and asymmetric quantization of activations that are more sensitive and impact the generation results more. So applying asymmetric quantization to activations helps to represent their values better and leads to better accuracy with no impact on the inference latency.
  • It was also discovered that the Fast Bias (error) Correction algorithm (FBC), which is a default part of NNCF PTQ, results in unexpected artifacts in the generated images. To disable FBC, we should pass advanced_parameters=nncf.AdvancedQuantizationParameters(disable_bias_correction=True) to nncf.quantize function.‍

Once the dataset is ready and the model object is instantiated, you can apply 8-bit quantization to it using the optimization workflow below:

import nncf
import openvino as ov
core = ov.Core()
unet = core.read_model(UNET_OV_PATH)
quantized_unet = nncf.quantize(
model=unet,
preset=nncf.QuantizationPreset.MIXED,
calibration_dataset=nncf.Dataset(unet_calibration_data),
model_type=nncf.ModelType.TRANSFORMER,
advanced_parameters=nncf.AdvancedQuantizationParameters(
disable_bias_correction=True
)
)
ov.save_model(quantized_unet, UNET_INT8_OV_PATH

Text-to-image generation

The left image was generated using the original LCM pipeline from PyTorch. The middle image was generated using the model converted to OpenVINO FP16. The right image was generated using LCM with the quantized Int8 U-Net. The input prompt is “a beautiful pink unicorn, 8k”, the seed is 1234567 and the number of inference steps is 4.

If you would like to generate your images and compare original and quantized models, you can run an Interactive demo at the end of OpenVINO LCM the Notebook.

We also measured time for the image generation by LCM pipeline with input prompt “a beautiful pink unicorn, 8k”, the seed is 1234567, and 4 inference steps.

*Average time across 3 independent runs.

Performance speedup PyTorch vs OpenVINO+NNCF is 1.38x.

Conclusion

In this blog, we introduced how to enable and quantize the Latent Consistency Model with OpenVINO™ runtime and NNCF:

  • Proposed NNCF Int8 PTQ quantization improves the performance of the image generation pipeline while preserving generation quality.
  • Provided OpenVINO LCM Notebook for model enabling, quantization, comparison of FP16 and Int8 model inference times, and deployment with OpenVINO™ and NNCF.

As the next step, you can consider migration to a native OpenVINO C++ API for even faster generation pipeline inference and the possibility of embedding it into the client or edge-device application. You can find an example of such a pipeline here.

Please give a star to NNCF and OpenVINO repositories if you find them useful.

Notices and Disclaimers:

Performance varies by use, configuration, and other factors. Learn more at www.intel.com/PerformanceIndex​. ​Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available ​updates. No product or component can be absolutely secure.​​​ ​Intel technologies may require enabled hardware, software, or service activation.​​​​​​​​

The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.​​

Test Configuration: Intel® Core™ i9–10980XE CPU Processor at 3.00GHz with DDR4 128 GB at 3600MHz, OS: Ubuntu 22.04.2 LTS. Tested with OpenVINO LCM Notebook.

The test was conducted by Intel on November 7, 2023.

--

--

OpenVINO™ toolkit
OpenVINO-toolkit

Deploy high-performance deep learning productively from edge to cloud with the OpenVINO™ toolkit.