Stable Diffusion ControlNet Pipeline with OpenVINO™ In C++

Published in

OpenVINO-toolkit

6 min readSep 9, 2024

Why Choose C++?

OpenVINO™ already supports the Stable Diffusion 1.5 and ControlNet pipelines in Python. Why Rewrite the Pipeline in C++?

In many cases, C++ offers better runtime efficiency and memory management compared to Python, and C++ projects can support a wider range of edge devices. Therefore, implementing this pipeline in C++ is highly beneficial.

The OpenVINO™ GenAI project has already implemented the Stable Diffusion 1.5 pipeline and adapted it for LCM LoRA in C++. This article will cover the journey of implementing the ControlNet component using C++ based on this pipeline. This project is also part of a Google Summer of Code (GSoC) initiative.

The Principle Behind the ControlNet Pipeline

ControlNet can be understood as a small auxiliary neural network that operates alongside Stable Diffusion. Its main function is to guide the U-Net model’s results by introducing specific conditions. During each denoising step, ControlNet takes in a predefined condition and generates data that influences particular layers within the Stable Diffusion U-Net, effectively shaping the final output.

The overall workflow can be visualized like this:

ControlNet typically operates in conjunction with annotators (referred to as detectors in this article). These detectors extract specific information from an image, which is then used as a control condition. ControlNet adapts to this condition and utilizes it as input within the process outlined earlier.

There are several commonly used detectors, and for this example, we’ve selected OpenPose — a highly popular and widely used detector.

OpenPose extracts skeletal information from an image and generates a pose map. This pose map is then used as input for ControlNet during the inference process.

Implementation

Detectors

First, we need to convert the detector models from the Python ecosystem to the OpenVINO IR (Intermediate Representation) format. You can refer to the process here.

In the second step, following the Python implementation, we re-implemented the preprocessing and post-processing for the detectors using the same algorithms, but in C++.

The main challenges in the implementation process lie in the inconsistencies between the Python and C++ ecosystems, as well as ensuring result alignment between the two.

To address the first issue, we replace Python libraries such as NumPy and PyTorch with their OpenVINO equivalents in C++. Operations related to scaling, rotation, and alignment are all re-implemented using C++.

The drawing logic used in Python, which renders the keypoints into a pose map, is implemented using the equivalent OpenCV-dependent C++ interfaces.

To ensure result alignment, a common approach is to export the tensor computation results from both Python and C++ into text files before and after each operation. These results can then be analyzed for differences using Jupyter notebooks.

Finally, after completing the entire pipeline, we conduct alignment checks both visually and at the pixel level. Below are some of the results we achieved:

I tested the implementation using data from the COCO dataset. For some results, the outputs from C++ and Python were pixel-level identical. For others, there were slight differences, but these variations are nearly imperceptible to the human eye and virtually impossible to distinguish.

You can view the complete test results here.

ControlNet

The implementation of the ControlNet component involves straightforward preprocessing and post-processing, primarily focusing on scaling and alignment. As a result, most of this work can reuse the relevant portions of the code implemented for the detectors.

Once ControlNet is integrated into the Stable Diffusion denoising loop, the entire pipeline is complete. The most challenging aspect remains ensuring that the results are aligned with those from Python.

We compared the results from Python and C++ using the previously described methods. Key areas where the results differ include:

1. Resize Algorithms: In Python, resizing operations are performed using a mix of OpenCV and PIL libraries. In C++, all resizing operations use OpenCV. This difference in libraries can cause variations in the final output.
2. Scheduler Inconsistencies: The schedulers in Python and C++ use the same algorithms, but the numerical results can still differ slightly. These discrepancies are minor and may arise from differences in standard library functions (e.g., log, exp operations) or variations in numerical types used during computation.

Below are the differences between the complete pipelines in C++ and Python, using the same seed and under different conditions.

In each row:

The first image shows the result from the Python pipeline.
The second image presents the result from the C++ pipeline, including the accumulated errors mentioned earlier.
The third image displays the result from the C++ pipeline using detector outputs processed by Python, which eliminates errors from the first point as it directly uses Python’s detector results.
The last two images illustrate the average grayscale maps of pixel differences in both scenarios, with brighter areas indicating larger discrepancies.

You can view the complete test results here.

Usage

Clone Repo

git clone --recursive -b detectors https://github.com/chux0519/openvino.genai.git

cd openvino.genai/image_generation/stable_diffusion_1_5_controlnet
/cpp/

Setup Environment

Prerequisites:

Conda(installation guide)

C++ Packages:

CMake: Cross-platform build tool
OpenVINO: Model inference
OpenCV: OpenPose dependency

Prepare a Python environment and install dependencies:

conda create -n ov_sd_controlnet python==3.11
conda activate ov_sd_controlnet
pip install -r ../../common/detectors/scripts/requirements.txt
pip install -r scripts/requirements.txt

Convert Models

Convert tokenizer

pip install openvino-tokenizers

convert_tokenizer openai/clip-vit-large-patch14 --with-detokenizer -o models/tokenizer

Convert the rest of the models

python scripts/convert_sd_controlnet.py

It will download missing models and convert them into OpenVINO models.

Build the Application

On Windows, open the command line prompt (x86 Native Tools Command Prompt for VS 2022)

"C:\Program Files (x86)\Intel\openvino_2024\setupvars.bat"
"D:\opencv\opencv\build\setup_vars_opencv4.cmd"

cmake -S . -B build -DOpenCV_DIR="D:\opencv\opencv\build\x64\vc16\lib" -DOpenVINO_DIR="C:\Program Files (x86)\Intel\openvino_2024\runtime\cmake"

cmake --build build --parallel --config Release

Run the Pipeline

.\build\Release\stable_diffusion_controlnet.exe -p "Dancing Darth Vader, best quality, extremely detailed" -n "monochrome, lowres, bad anatomy, worst quality, low quality" -i ".\scripts\pose.png" -s 42 --step 20 -d GPU.1

Additionally, if you’re interested in the GUI, I’ve created a demo using ImGui. You can compile the code from this branch and give it a try.

The overall layout is inspired by the img2img feature from a111, and it looks something like this:

Acknowledgment

I am deeply honored to have been accepted into GSoC and to have had the opportunity to collaborate with so many talented individuals. It’s particularly exciting that this year’s GSoC opened up participation to non-students like myself, giving me the chance to explore new areas of interest.

In this rapidly evolving AI landscape, the knowledge I’ve gained will undoubtedly serve as valuable experience for my future work.

I am truly grateful to the OpenVINO community and GSoC for providing such a wonderful opportunity. I would also like to extend special thanks to my mentors, Fiona and Su Yang, for their patience and invaluable guidance throughout the implementation process.

Notices & Disclaimers

Performance varies by use, configuration and other factors. Learn more on the Performance Index site.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.

Your costs and results may vary.

Intel technologies may require enabled hardware, software or service activation.