Denoising Diffusion-based Generative Modeling: Foundations and Applications

6 min readJun 18, 2023

Note that all contents are from CVPR 2022 Tutorial lectured by Arash Vahdat, Karsten Kreis and Ruiqi Gao. The full video can be accessible here.

Introduction to Denoising Diffusion Models

Lots of diffusion-based generative models have been proposed with similar ideas, including denoising diffusion probabilistic model (DDPM; Jonathan Ho et al.) and denoising diffusion implicit model (DDIM; Jiaming Song et al.). The diffusion models consist of two processes: Forward diffusion process and Reverse denoising process. In the forward process, the noise is gradually added to input, while in the reverse process it requires to learn how to generate data by denoising.

Forward Diffusion Process

Let us define the forward diffusion process in which a small amount of Gaussian noise is added to the samples in T steps. Specifically, the forward diffusion is defined as a Markov chain and its step size is controlled by a variance schedule.

The Markovian process embrace a nice property that the sample at any arbitrary time step is accessible in a close form and without going through all the intermediary steps. Additionally, the variance schedule is designed to satisfy that the sample will become a white noise at the time step T.

Note that the forward diffusion process is analog to the encoder part of AutoEncoders, the forward process is fixed corresponding to the pre-defined variance schedule while the encoder of AutoEncoders requires a training.

And how about the generation? Recall that the diffusion parameters are designed such that the distribution of samples at time step T is Gaussian distribution. As a result, the generation is able to be derived as the follows. However, it’s unfortunate the true denoising distribution is intractable in general, but small variances in each forward diffusion step enable to approximate the denoising distribution with a Normal distribution.

Reverse Denoising Process

Although it guarantees the approximation in a Normal distribution, the distribution is still difficult to estimate since it needs an understanding of the entire dataset. Therefore, we learn a model as an alternative to approximate the conditional probabilities in order to generate data from noisy samples.

In order to train a model for the approximation, the training objective needs to be derived and the variational upper bound is commonly used for such models.

Sohl-Dickstein et al. ICML 2015 and Ho et al. NeurIPS 2020 show that the variational upper bound can be re-written as follows.

In the variational upper bound, the first term is independent of model parameters and a constant as well, so it can be neglected during training. The first conditional probability in the middle terms is tractable posterior distribution and derived as follows. Given the clean data and the noisy data, predicting the less noisy data can be considered as an interpolation of the clean and noisy data.

The KL divergence in the middle terms has a simple form since both conditional probabilities are Normal distributions.

Recall that the samples at time step t can be computed by diffusion kernel, and the above equation can be reformed as follows.

The time dependent weights of the l2 distance ensure that the training objective is weighted properly for the maximum data likelihood training. However, the weight is pretty large for small time step t. As a result, Ho et al. NeurIPS 2020 observed that simply setting the weight to 1 improves sample quality. The training and sampling algorithms in DDPM are presented as follows.

For the implementation, diffusion models often employ U-net architectures. It predicts the noise given the noisy image and time representation, which often use sinusoidal positional embeddings or random Fourier features.

Recall that the noisy data is sampled using diffusion kernel. We apply the Fourier transform to the diffusion kernel, and observe the response in frequency domain. It turns out that the most images in frequency domain have very high response for low frequency. For small time step t, the noise actually don’t perturb the low frequency content of the image but perturb the high frequency content. On the other hand, for large time step t, all frequency content of the image is pushed down due to the weight, so the low frequency content of the image is easily perturbed by the noise.

Hence, there is a tradeoff between content and detail. The denoising model is specialized for generating the high-frequency content (low-level details) for small time stept t while for generating low-frequency content (coarse content) at large time step t.

Connection to VAEs

Diffusion models can be considered as a special form of hierarchical VAEs. Nevertheless, in diffusion models:

The encoder is fixed
The latent variables have the same dimension as the data
The denoising model is shared across different timestep
The model is trained with the reweighting of the variational bound

What makes a good generative model?

There is a generative learning trilemma: Fast sampling, Mode coverage/Diversity, High quality samples. The generative adversarial network (GAN) has a lack of mode coverage while the likelihood-based model like VAE and normalizing flows sacrifice the high quality samples. In diffusion models, its drawback is the slow sampling since the denoising process needs several iterations. Accelerating diffusion models will tackle the trillemma, but how to speed up?

How to accelerate diffusion models?

Naive acceleration method is to reduce time steps in training or sampling, but it usually leads to worse performance. The idea immediately pop out that does it have to be a Markovian process.

Song et al. ICLR 2021 designs a family of non-Markovian diffusion processes and corresponding reverse processes. In new definition of diffuison process, the model can be optimized by the same surrogate objective as the origin diffusion model. Therefore, a pre-trained diffusion model can be employed with more choices of sampling procedure.

The forward process and corresponding reverse process are derived as follows.

Conclusion

Here we present two diffusion-based models that commonly used nowadays: DDPM, DDIM. Yet, many topics in the CVPR tutorial do not be covered in our article. See here for looking into the rest of the tutorial.

References

[1] Lil’Log, What are Diffusion Models. https://lilianweng.github.io/posts/2021-07-11-diffusion-models/