What Is ControlNet? ControlNet Simple Explanation (Stable Diffusion)

9 min readApr 5, 2023

After using Stable Diffusion for a while, you might have noticed that the output of images have endless possibilities of outcomes. Feels like playing a gamble whether this one click can produce good images or not. Yet, you will feel that there is still a limit on how much you can take the wheel or control the AI even if you input such intricate and elaborated prompts. There would be limitations on how far the AI can fit the prompting or images in the exact same way.

As an example, like this simple output that we generated using image to image. Yes, the AI captured the main points of the images but it still doesn’t feel like the same person, let alone the hands. It also lost some details of the face and hair.

Stable Diffusion has a lot of controls and rooms to learn and master besides the basic parameters. All of the controls have an important role depending on what you aim for on the end results. This article will help to cover the additional controls that you need to generate that one exact picture and see how each of the controls can help to elevate your pictures to fit your preferences.

ControlNet is an extension of Stable Diffusion, a new neural network architecture developed by researchers at Stanford University, which aims to easily enable creators to control the objects in AI image and video. It will control by manipulating the image generation based on various conditions such as edge detection, depth analysis, sketch processing, or human pose. ControlNet can be summarized as a simple method to fine-tuning in Stable Diffusion.

This will allows for greater precision and improved control in creating images using Text-to-Image and Image-to-Image features. The popularity of ControlNet is due to its simplicity in usage. An established pose or shape can be generated from any photo sourced from the internet. We will discuss each part of the ControlNet to know how does each config can affect the output.

Inside the ControlNet Image Checkboxes, there are several options :

Enable : Check this box to enable the ControlNet on your image generating process.
Invert Input Color: This tool is used to detect the image that user’s upload. Preprocessors and models usually require white lines on a black image to detect patterns. But if you have a black line on a white image, you can use this tool to switch the colors.
RGB to BGR : This tool is for detecting color information in user-imported images. Sometimes, the color information in the images may be different from what the extension expects. If you are using the “normal map” processor, you can skip this
Low VRAM : This will slow down the ETA process but help to use less computing space (less than 6 GB VRAM)
Guess Mode: It enables the ControlNet to recognise the object in the imported image (without needing prompt & negative prompt) with the selected preprocessor

The Resize Mode is provided to adjust your ControlNet size and aspect ratio of the uploaded images.

Envelope (Outer Fit): adjusting the size of Txt2Image to fit within the dimensions of the ControlNet image. It will resize the image until the Txt2Image settings can fit within the ControlNet image.
Scale to Fit (Inner Fit): adjusting the size of the ControlNet image to fit within the dimensions of Txt2Image. It will resize the image until it can fit within the width and height of the Txt2Image settings.
Just Resize: refers to changing the size of the ControlNet image to match the width and height of the Txt2Img settings without preserving the aspect ratio. This involves stretching or compressing the image to fit the specified dimensions.

The Canvas Width and Canvas Height are only used when you want to create drawings or sketches directly to stable diffusion without uploading any image (preferably using Scribble preprocessor for good outputs). It will resize the blank Canvas for you to draw and won’t affect the original image uploaded.

The “Preview Annotator Result” enables you to have a quick look at how your chosen preprocessor will transform the image or drawing you uploaded into a detectmap for ControlNet. This feature is particularly helpful for trying out various preprocessors before rendering the output images as it is efficient in time. If you want to remove the preview image, you can click the “Hide Annotator Result” option.

The Preprocessor and Model are the main tools of Control Net. Users are given a selections of Control based on the desired output. Do note that there are no the best or worst choice in this list and all has its own criteria and compatibility depending on the input images. Each of the Preprocessor has it’s own Model that was designed and named the same. But it does not limit you to mix and match all the Preprocessors and Models, we have tried to do it but most of the times the results are not what we expected and it’s still best to use the same Model matched. We will cover several famous ControlNet in Stable Diffusion and gives you the example of how it’s used below.

Canny

In the workflow illustrated below, Canny detects the input image by creating the outlines of the high contrast area using the edge detector. The canny lines can capture very detailed informations, but if your image had some objects in the background, high probability it will detect the unwanted object. It works best if there are less objects in the background. The optimal model to used for this preprocessor is control_sd15_canny. We can say, the results are canny!

Preprocessor : Canny & Model : *control_sd15_Canny*

Canny Map — https://doi.org/10.48550/arXiv.2302.05543

Depth & Depth Leres

This preprocessor helps to generate a depth estimation of the input image. Depth is commonly used to control the spatial positioning of objects inside the image. The light colored area means that it’s closer to the user and darker areas are further away from the user as in fading in the background.

It can still generate a similar image and capture the bigger picture but could lose the details inside the image (face, facial expression, etc). It is usually used with control_sd15_depth models. The Midas Resolution function is to increase or decrease the size and level of detail in the detectmap. The higher level it is, the more VRAM will be used but can generate higher quality image and vice-versa.

Meanwhile, Depth Leres have the same base concept as Depth but includes wider range into the map. But sometimes it captured too much information from the picture and could generate a slightly different image than the original. It is best to try both of the Preprocessor first and decide which one is compatible to your criteria.

Preprocessor : Depth & Model : *control_sd15_*Depth

Preprocessor : Depth Leres & Model : *control_sd15_*Depth

HED (Holistically-Nested Edge Detection)

The Hed Preprocessor is optimal method for image processing that creates clear and refined boundaries around objects, the output resembles Canny but with reduced noise and more softer edges. Its effectiveness lies in its ability to capture complex details and contours while retaining the detailed features (facial expression, hair, fingers, etc.). The Hed Preprocessor can be used for modifying the style and color of images. The optimal model to used for this preprocessor is control_sd15_hed.

Preprocessor : Hed & Model : *control_sd15_*hed

HED Map — https://doi.org/10.48550/arXiv.2302.05543

MLSD ( Mobile Line Segment Detection)

The MLSD Preprocessor map works best to produce strong lines that are able to detect the architecture and other man-made creations that require distinct and rigid outlines. However, it does not suitable for working with non-rigid or curved objects. MLSD is suitable to generate an interior room layouts or building architecture as it can accentuate the straight lines & edges. The optimal model to used for this preprocessor is control_sd15_mlsd.

MLSD Map — https://doi.org/10.48550/arXiv.2302.05543

Normal map

The normal map used three main colors (red, green, blue) to pinpoint the level of roughness, smoothness of an object by different angles. It generates a rudimentary estimation of a normal map that can retain a considerable amount of detail, but may produce unintended outcomes since the normal map is derived solely from the image and not constructed in a 3D modeling software.

Normal mapping is advantageous in highlighting intricate details and contours, and it is also effective in situating objects, particularly in terms of proximity and distance. The “Normal Background Threshold” is functioned to adjust the background components in the map. Setting a higher threshold value removes the distant part of backgrounds (blending it into the purple color). Lowering the threshold value will command the AI to retain or even reveal additional background elements. The optimal model to used for this preprocessor is control_sd15_normal.

Normal Map — https://doi.org/10.48550/arXiv.2302.05543

OpenPose

This preprocessor generated a base skeletal stick-man figure to be used as a map and controller. This technique is widely employed since multiple OpenPose skeletons can be combined into a single image, which aids in guiding Stable Diffusion to generate multiple consistent subjects. The skeleton figure have many joint-point and each of the point represent the object‘s joint and the OpenPose will automatically detect it. To optimize the map result from OpenPose, it is recommended to upload an image of a human (either full body or half body) with the poses that you wanted to extract. The optimal model to used for this preprocessor is control_sd15_openpose.

OpenPose Points — https://viso.ai/deep-learning/openpose/

OpenPose Map — https://doi.org/10.48550/arXiv.2302.05543

Scribble

Scribble is designed to generate an image from simple black and white line drawings and sketches. These drawings are typically uploaded as an image, but users can also use the “Canvas” option to create a blank canvas of a specific size for manual sketching. If the sketches and drawings are made up of black lines on a white background, the “Invert Input Color” checkbox must be checked. The optimal model to used for this preprocessor is control_sd15_openpose.

Scribble Map — https://doi.org/10.48550/arXiv.2302.05543

Segmentation

The Segmentation preprocessor detects and divides the uploaded image into segments or areas inside the same image. The model applies the detectmap image to the text prompt when generating a new set of images. The optimal model to used for this preprocessor is control_sd15_seg.

Segmentation Map (1) — https://doi.org/10.48550/arXiv.2302.05543

Segmentation Map (2)— https://doi.org/10.48550/arXiv.2302.05543

After knowing the basic and foundations of the ControlNet, you will still need to experiment and get used to the ControlNet configs. Each of the images and prompts may react differently towards the chosen ControlNet. Utilizing ControlNet also helped you to prevent putting too many messy prompts just to generate a certain images.

In case you want to learn further regarding the ControlNet, you can access this link and . These ControlNet will still keep updating in the future and it won’t close the possibility to keep expanding the new ones as well. If you still missed the detail, you can refer to this table in column Preprocessor and ControlNet Model.

If you have anything to ask or want to brainstorm with us, feel free to join our Antalpha.io telegram and stay tune to our twitter! :)

What Is ControlNet? ControlNet Simple Explanation (Stable Diffusion)

Written by Antalpha.ai