Denoising Diffusion-Based Generative Modeling
#CVPR2022 Tutorial “Unofficial” Minutes
The following essay represents an unofficial meeting minutes of sorts, recording (at a high level) the discussions presented at the IEEE Computer Vision and Pattern Recognition Conference Denoising Diffusion-based Generative Modeling: Foundations and Applications Tutorial. (Yes I know that is a mouthfull, it took me like two whole minutes to type that sentence.) I recorded these notes as an attendee and given how outstanding the workshop turned out wanted to share with the community as a form of contribution to online attendees. If any of the speakers desire that I withdraw this content please feel free to contact me as you see fit. Please note that the images are included by permission of original presenters. These image excerpts are of slides by those authors, as are a majority of the talking points presented herein — which in a few cases were recorded in a near verbatim manner, in other cases vastly abbreviated. All errors in grammar and content are by this author, all material contributions by the original workshop presenters or prior work. We’ll note a few citations adjacent to slide excerpts, for additional relavent citations a link to the tutorial home page is also provided shortly.
Yeah so fine print complete, presented here are the unofficial meeting minutes for your erudition and amusement. Enjoy.
Denoising Diffusion-based Generative Modeling: Foundations and Applications
Originally presented in live presentations by Karsten Kreis, Ruiqi Gao, and Arach Vahdat (affiliations with Nvidia and Google AI)
A general way to thing about generative models are that the training data follows some underlying distribution, and in generation we are attempting to sample from that distribution to produce unique compositions that adhere to distributions of the training data. Extensions of generative modeling may include representation learning in which semantic features are extracted with only limited labeling. Generative models may also serve as artistic tools.
Denoising diffusion models (DDM) are a new framework that will likely revolutionize generative deep learning in the near future. As we’ll describe further below, denoising diffusion models may enable higher tiers of “super-resolution”, and are at the foundation of emerging platforms of text-to-image generators like has been demonstrated by OpenAI’s “Dall-E2” or arguably even more impressively by Google’s “Imagen” models.
Part 1 — Denoising Diffusion Probabilistic Models
Derived from presentation by Arash Vahdat
In a broad sense, the training of denoising diffusion models follows a forward and backward noise ablation process. In the forward “diffusion” process, noise is gradually added to input training images. In a reverse denoising process, the model learns to generate data by way of sequentially and iteratively denoising to recover an unobscured form.
Traditionally, at every noise adding step, the noise is drawn from a Gaussian distribution, but with the mean rescaled to the preceding version of the image and the scale set to some very small value.
Because this type of translation can be considered a Markov process, one can derive a joint distribution of the collective set of progressively noised images based on conditionals from each step, which in each case follow a Gaussian progression. One could even form a reparameterization used to skip forward between noise adding steps if desired, by framing as a product of an input distribution times a diffusion kernel, which kernel itself is a Gaussian convolution.
As demonstrated here, each additional noise application results in a smoother version of the input distribution, making the data generating distribution appear smoother and smoother and with an asymptote approached at the point where the diffused data distribution matches the noise distribution.
The application of sampled noise in this manner is very computationally efficient, especially since the reverse denoising process doesn’t need access to intermediate stages, they can be walked back to recover in training.
At face value, a denoising distributions should be considered intractable. One of the tricks is that they become tractable when the forward noising passes are applied with a very small scale of variance, which with further assumptions we’ll make note of below will allow for the denoising itself to be approximated as a Gaussian distribution as well.
Thus denoising is just trying to predict the mean image corresponding to a noisier version of itself, which is conducted by training an architecture we refer to as a “U-net”, in which we present embeddings of time and positional representations to a bottleneck of ResNet blocks and self-attention layers to model the noising trained against labels of the corresponding sampling. Training this noise-prediction network will then be repurposed to represent a mean of denoising model for our generation synthesis.
This process can also be described as a latent variable model, similar to what is learned in variational auto-encoders (VAE’s), as we are mapping data to another space to make denoising tractable. Since both noising and denoising are approximated as normal, evaluating the resulting KL-divergence has a simplified form.
Another way to think about it is that in the forward distribution process is through the lens of a Fourier transform, and in the Fourier domain there are distinct responses associated with the input image verses the noise — recall that the Fourier transform of a Gaussian sampling is itself Gaussian. In the forward noising process, high frequency content is perturbed faster, and then in the denoising process each step specializes in recovering image features of progressively higher frequency.
Part 2 — Score-based Generative Modeling with Differential Equations
Derived from presentation by Karsten Kreis
Ok fair warning, in part 2 going to get a little more theory bound. Time to put on your reading glasses.
Consider our forward diffusion process, but this time approaching the limit of smaller and smaller noise scalings. We’ll realize a framing for sampling in which we can consider the image after noise as a function of the preceding image with sampling parameterized by Beta times a time step size, aka
β(t)*Δt, which can then go through a Taylor expansion. At the infinitesimal limit, we can replace Δt with dt, and eureka: we’ve established a differential equation, one that can be solved by a simple sampling and update rule, and thus we’ve framed our denoising process as a stochastic differential equation (SDE).
Ok let’s back up for a second. How many of us remember differential equations, can I see a show of hands? Ok those that raised their hand can skip forward a few steps.
Consider an ordinary differential equation (ODE). If we dont’ know x(t), but do know dx/dt, in simple cases we may be able to achieve an analytical solution through integration. In practice it will usually be way too difficult to perform integration analytically, but fortunately we can instead perform an numerical integration iteratively.
Extending to a stochastic differential equation, we can frame the solution as composed of two additive features, a drift coefficient and a separate diffusion coefficient that injects noise at each time step. Through iterations, the drift term will “pull” updates towards the underlying model, while the diffusion term will inject stochasticity. This type of drift/diffusion additive framing is actually a special case of more generalized stochastic differential equation framings.
We thus have a SDE framing for the forward noise injection process, does this still translate to the reverse direction of denoising? The speaker considered it an amazing result by Song et al that in fact we can adapt by simply adding a “score function” to the drift term to realize a noise sampling based data generation process for denoising.
That does leave us with the question of how we derive that score function. One naive idea could be to train a score function neural network, unfortunately the variable
qt(xt), which is a score of marginally diffused density, on its own is not a tractable distribution. The preferred method is known as “denoising score matching”, with a small and important difference: by instead framing as a conditional distribution
qt(xt|x0) with respect to individual data points
x0, a neural network modeling can be derived. And funny enough, after this and another few steps not shown, the resulting expectations for the conditional end up approximating the full marginal, the same one we just considered intractable. How about that?
(As a further explanation in support of this proof, the speaker was going pretty quickly here and I decided to take a few sips of coffee, so yeah didn’t follow along with everything, including those slides where he derived three different ways to implement score matching for stochastic differential equations. You know, man not a machine and etc.)
An important distinction that can be considered for framing the denoising is associated with the difference between a SDE synthesis verses a generative probability flow, as SDE implements a stochastic derivation and probability flow a deterministic type of synthesis. By translating to a deterministic form it enables the use of advanced, faster ODE solvers and further opens the door to advanced use cases like semantic interpolations between images. This diffusion mode can be considered a kind of continuous normalizing flow, which makes it much more scalable (as in training on gigantic amounts of data), and the speaker noted several different types of ODE solvers that could be applied towards the “continuous-time” framing, in which the diffusion models are basically learning the gradients of an energy function.
(Traditionally energy based models are hard to train and require sampling from langevin dynamics, but here we only require the gradients of the energy, not the energy itself, and so voila. For maximum performance a stochastic form of synthesis may still be preferred though.)
Part 3 — Advanced Techniques in Diffusion Models: Accelerated Sampling, Conditional Generation, and Beyond
Derived from presentation by Ruqi Gao
In comparing diffusion models to prior art for generative modeling, diffusion models demonstrate superior performance to generative adversarial networks with respect to mode coverage and diversity, and also outshine variational autoencoders with respect to quality of samples. However an important tradeoff can be identified associated with latency of synthesis sampling.
Several lines of inquiry are associated with identifying sampling speedups. How can we advance forward diffusion? How can we accelerate denoising? Catching up with GAN’s would signal that diffusion models could be considered suitable for any generative application.
A naive way to speed up denoising could be to reduce the number of denoising time steps in synthesis (current state of the art falls in range of 4–10 steps), however fewer time steps leads to degraded performance. The speaker noted several types of strategies that have been considered instead, including appending Fourier features to the U-net input to improve log likehood estimation, applying a non-Markovian forward diffusion process, applying a Euler method for the first ODE equation, critically damped Langevin diffusion (aka fast mixing diffusion), and momentum based diffusion. (If you didn’t understand a word of any I just said that makes two of us.)
To attempt speedups in diffusion, the speaker noted approaches like applying denoising diffusion GANs, distilling a trained diffusion model to an embedding, or pre-training a VAE to evaluate latent space. (In each case the speaker offered citations of prior work that has explored these tactics in more depth.)
The speaker went into some depth on several of these advanced techniques that almost universally flew way way over this author’s head, so in their place I present to the reader here a Penrose triangle because it is pretty to look at.
Enough with the words. Let’s let the generative models speak for themselves. Presented here are a few representative demonstrations for your erudition and amusement.
That’s all folks.
Dhariwal, P. and Nichol, A. Diffusion models beat gans on image synthesis. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 8780–8794. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper/2021/file/49ad23d1ec9fa4bd8d77d02681df5cfa-Paper.pdf.
Kreis, K., Gao, R., and Vahdat, A. Denoising diffusion-based generative modeling: foundations and applications, 2022. URL https://cvpr2022-tutorial-diffusion-models.github.io/.
Meng, C., He, Y., Song, Y., Song, J., Wu, J., Zhu, J.-Y., and Ermon, S. Sdedit: Guided image synthesis and editing with stochastic differential equations, 2021. URL https://arxiv.org/abs/2108.01073.
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. Hierarchical text-conditional image generation with clip latents, 2022. URL https://arxiv.org/abs/2204.06125.
Saharia, C., Chan, W., Chang, H., Lee, C. A., Ho, J., Salimans, T., Fleet, D. J., and Norouzi, M. Palette: Image-to-image diffusion models, 2021. URL https://arxiv.org/abs/2111.05826.
Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=PxTIG12RRHS.