Understand Classifier Guidance and Classifier-free Guidance in diffusion models via Python pseudo-code

Baicen Xiao
7 min readApr 28, 2024

--

Image generated from stable-diffusion-xl-base

When first encountering diffusion models, we typically start by learning about the forward process (from image to noise) and the backward process (from noise to image), where images are generally generated from noise without specific conditions. However, we often want to control the generated images, such as generating only dogs or cats, or creating a cartoon image from a selfie.

In such cases, we need to introduce conditional controls, which involves understanding classifier guidance and classifier-free guidance. The purpose of this article is to quickly introduce the reader to the key differences between these two approaches, via pseudo-code and some basic mathematical explanation.

First, let’s take a look at the table below which shows the main differences between classifier guidance and classifier-free guidance when using them.

Classifier Guidance

The Markov chain of forward (reverse) diffusion process of generating a sample by slowly adding (removing) noise. (Image source: Ho et al. 2020 with a few additional annotations from lilianweng blog)

Each step of the reverse process of diffusion model is to generate less noisy version of image x_{t-1} from noisy image x_t:

We won’t dive deep into the derivation of equations, but if you are interested in the mathematical derivation, please refer to this nice blog and tutorial article.

The most important part is the mean of the Gaussian distribution. If we can get the mean of the distribution, then we can simply use the mean as x_{t-1} directly. There are three equivalent ways to present the diffusion models (i.e., learn the x_0, learn the noise and learn the score function, refer to this nice tutorial if you are interested). For the ease of introducing classifier guidance and classifier-free guidance, we choose to estimate the mean of the Gaussian distribution via score-based estimation:

where the gradient of log distribution of x_t is called ‘score’ and usually not directly available, but can be approximated by learning a score function via neural networks:

and then the mean of the Gaussian distribution becomes:

As you can see, the mean is only a function of x_t which means no other condition has been introduced so far. The question of how to inject condition (denoted as y) becomes how to modify the score function.

After introducing the condition y, the probability p(x_t) in the gradient of equation (1) becomes conditional probability p(x_t | y):

By apply Bayes’ Theorem, we obtain a new score function:

The first term is the gradient guidance of the unconditional diffusion model itself; the only addition is the second term. This means that for Classifier Guidance in conditional generation, one only needs to add an extra gradient from a classifier for guidance. This equation is the same as equation (12) in the paper.

To control the strength of guidance, we can add a guidance_scale parameter \lambda:

To help you understand how the equation (2) works, here is a pseudo-code of classifier guidance:

classifier_model = ...  # Load a pre-trained image classification model
y = 1 # We want to generate an image of class 1, let's assume class 1 corresponds to the "cat" category
guidance_scale = 7.5 # Controls the strength of the class guidance, the higher the stronger
input = get_noise(...) # Randomly draw noise with the same shape as the output image from a Gaussian distribution

for t in tqdm(scheduler.timesteps):

# Use neural networks for inference, to predict noise (score, the first term in equation (2))
with torch.no_grad():
noise_pred = model(input, t).sample

# Classifier guidance step
class_guidance = classifier_model.get_class_guidance(input, y) # Compute gradient using classifier (second term in equation (2))
noise_pred+= class_guidance * guidance_scale # Apply the gradient

# Calculate x_{t-1} using the updated noise
input = scheduler.step(noise_pred, t, input).prev_sample

The key is in the third to the last line:

class_guidance = classifier_model.get_class_guidance(input, y)

What this line does is feed the current generated image ‘input’ and the desired category ‘y’ into the classification model.

The category identified by the classifier may not necessarily be ‘y’, but we want it to be as close to ‘y’ as possible, so we calculate the loss between the predicted value and ‘y’.

Then, similar to how gradient back-propagation is done during classifier model training, we calculate the gradient. The difference is that, while training a classifier model requires obtaining gradients of the weight parameters for updating via gradient descent, here we only need to retain the gradient with respect to the ‘input’.

Finally, we add the calculated gradient to the image, scaled according to the ‘guidance_scale’.

If you are interested in the real code implementation, you may refer to this line from OpenAI.

Classifier-free Guidance

Classifier-free guidance only requires a slight modification on the basis of classifier guidance. By manipulating the gradient, we can obtain the equation below

plug into the equation (2) shown in classifier guidance part, we have

Classifier Guidance can only control the categories generated by the classification model. If the classification model distinguishes 10 classes, then Classifier Guidance can only guide the diffusion model to generate those fixed 10 classes; it won’t work for an additional class.

Classifier-Free Guidance is a powerful method. Although it requires retraining the diffusion model, once trained, it can take off directly without any limitation on the number of categories. It can work even when the condition is text or image.

Let’s take a look at an example of where the condition is text:

clip_model = ...  # Load an official CLIP model

text = "a dog" # Input text
text_embeddings = clip_model.text_encode(text) # Encode the conditional text
empty_embeddings = clip_model.text_encode("") # Encode empty text
text_embeddings = torch.cat([empty_embeddings, text_embeddings]) # Concatenate them together as the condition

input = get_noise(...) # Randomly draw noise with the same shape as the output image from a Gaussian distribution

for t in tqdm(scheduler.timesteps):

# Use UNet for inference, to predict noise
with torch.no_grad():
# Here we predict noise for both images with text and images with empty text
noise_pred = model(input, t, encoder_hidden_states=text_embeddings).sample

# Classifier-Free Guidance guidance
noise_pred_uncond, noise_pred_text = noise_pred.chunk(2) # Split into unconditional and conditional noise
# Consider the vector from "unconditional noise" towards "conditional noise",
# and scale this vector according to the value of guidance_scale
noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)

# Calculate x_t-1 using the predicted noise_pred and x_t
input = scheduler.step(noise_pred, t, input).prev_sample

If we want the generated image to adhere more closely to the textual semantics of “a dog,” we would set the guidance_scale higher, making the image adhere more closely to the semantics of “a dog,” but at the expense of reduced diversity. Conversely, if we want the generated images to be more varied and rich, we would set the guidance_scale lower. Typically, a value of 7.5–10 is used.

The image generated from Stable Diffusion v1.5 under different guidance scale. Image source: https://zhuanlan.zhihu.com/p/660518657

If you want to see real code for classifier-free guidance, you may check this line of stable diffusion pipeline from Hugging Face diffusers.

In summary, classifier-free guidance requires training the model’s two capabilities during the training process: one for conditional generation and one for unconditional generation. Actually this can be done by training a single network by adding the condition as an additional input as shown in the pseudo-code above.

With enough high quality training data, classifier-free guidance tends to yield better results, being able to generate an infinite (almost) number of image categories without the need to retrain a classifier model based on noise. Therefore, classifier-free guidance is the most common approach used currently for text2image and image2image generation (such as OpenAI’s DALL·E 2 and Google’s Imagen).

--

--

Baicen Xiao

Machine Learning Engineer @ Adobe | PhD @ University of Washington