Latent Consistency Models (LCMs) Explained

5 min readNov 20, 2023

So we’ve all heard about diffusion models and stable diffusion. This month, there was a sudden surge in the popularity of “latent conditioning models,” this new adaptation of diffusion models that made them blazingly fast.

I tried to find something to explain the actual workings of it online, and the paper was a pretty complicated read. I thought I’d publish this blog so that someone in my position can understand it quicker!

Latent Diffusion Models (LDMs)

First, a quick overview of latent diffusion models

An input image, x, comes into the model. x encoded by an encoder Ɛ, creating a smaller-dimension vector z inside a “latent space”, a much smaller dimensional space.
This latent space vector z continually has gaussian noise added to it for T steps — this is the “diffusion” process. This vector z after noise is added to it T times is called z_T.
When generating images, we frequently have captions that guide the generation, or maybe other contextual information. For each of these contexts, we use a special τ_θ to encode the context into the same latent space as z
There are a series of denoising U-nets whose purpose is to guess the amount of noise added in the corresponding “noise-adding” step in the diffusion process. So the first denoising U-net estimates the number of noise added at step T of the diffusion process. The denoising U-net takes in the encoded contexts and z_T
After T denoising U-nets, we have a decoder D to take us back to the image space, producing a cool image!

Consistency Models (CMs)

Okay, now that we’re done with diffusion, let’s talk about consistency models (CMs).

The motivation behind CMs comes from a significant drawback in diffusion models: the T-step denoising process to get the final image.

Consistency models try to accomplish this denoising task in 1 step. In CMs, we take a look at the noisy versions of the data (z_1 to z_t in LDMs) and learn a function that takes us directly to the denoised version (z_0 in LDMs) in 1 step. Mathematically, this looks like the following function:

Where z*_0 is the function’s estimation of z_0. Note that this means that regardless of what timestep you choose, the estimation is supposed to be the same:

This is why they are called consistency models.

This brings about our consistency loss that is used to train CMs:

θ- is our “target” model, which comes from a previous timestep of our current CM. θ is our current CM. x^{Φ}_tn is our estimation of x in the t_n timestep using our target model. d is a distance function, like L2 distance.

This loss ensures that whether we start at x_(t_{n+1}) or x_{t_n}, we try to get the same approximation for x_0. Consistent values for x_0 are maintained!

Latent Consistency Models (LCMs)

Similar to LDMs, LCMs operate in a latent space that is smaller than the gigantic pixel space.

The idea behind LCMs is that we do the same process as in LDMs but in one step! We stay in the latent space for quick computation and then decode the z*_0 that we get from the function we just talked about earlier. LCMs are created from an already well-trained diffusion model.

The way they do this is through a minimization of the consistency distillation loss. Here’s the equation, we’ll explain it right after!

θ- is directly related to the weights from the teacher (well-trained diffusion model), and θ is related to the LCM model that we are trying to train. The f function is the exact same function as what we talked about previously and figures out z_0.
(zˆ Ψ t_n) helps create a method for the LDM teacher function to calculate a z_0, through an integral (there’s some math here, if you’re interested, check out the paper in detail.
Putting it all together, the loss is the distance between what the LCM model says z_0 is from timestep t_{n+1}, and what the LCM model says z_0 is from timestep t_{n}.

Additional Info

In the LCM paper, there are some additional pieces of information that are of interest to people who are actually interested in model development and training.

Classifier-free Guidance

First, the inclusion of classifier-free guidance (CFG). In the denoising stage of diffusion models, it has been proven that it is actually also useful to ignore context (prompts, other images, etc) in the calculation of the previous z_{t-1}. The estimate for the amount of noise that our U-Net predicts between z_{t-1} and z_{t} is written as

Where we use ω to balance classifier and classifier-free guidance. We incorporate this into our consistency distillation loss in LCMs by changing it to:

This way, the calculation of z_0 by our f also includes the CFG.

Skipping Time Steps

LDMs sometimes have thousands of diffusion steps, which means that in order for us to create our consistency models, we have to calculate our consistency loss from timestep 1000 to 999, 999 to 998, …, 1 to 0.

The difference in our values of z between a single timestep is probably tiny, and trying to learn those small changes is tough. Getting this model to train would take a while.

Instead, the paper says to change our consistency distillation loss to look at k timesteps away from the current one:

Now, we don’t have to sample extremely close intervals like with z_1000 and z_999. The paper used k=20, so we would now do z_1000 and z_980. The difference in the z values is probably much bigger now!

Conclusion

That’s the overview on LCMs! A couple of key takeaways:

LCMs directly try approximating (in one step) ALL of the noise that the diffusion process adds.
LCMs operate in the latent space, just like LDMs.
LCMs are “distilled” from LDMs by using a smart consistency distillation loss that ensures that regardless of how many diffusion timesteps you have taken, the denoiser will get you back to the same initial image.

Hope that helps!