Understanding Diffusion through Denoising Diffusion Probabilistic Models (DDPM), Part 1

8 min readApr 11, 2023

Welcome to Part 1 of the exciting blog series, where we’ll dive deep into the world of diffusion models — a cutting-edge class of generative models that are taking the AI landscape by storm!

In this introductory blog, we’ll unravel the fundamental concepts and terminologies used in the groundbreaking Denoising Diffusion Probabilistic Models (DDPM) paper. This foundation will be instrumental in understanding the advanced topics we’ll explore in the upcoming articles of this series. I hope that this blog sparks your curiosity and ignites your passion for this innovative field of AI research.

Denoising Diffusion Probabilistic Models (DDPM) is one of the papers which progressed the foundations laid (by the paper: Deep Unsupervised Learning using Nonequilibrium Thermodynamics) to the novel class of generative models using a process of diffusion.

The process of diffusion refers to the way of diffusing noise (sampled randomly from the normal distribution) into the image until there is nothing else but the noise. This process is called as an iterative forward diffusion process (Figure 1). Once, the original image is completely noisy, the process of reverse diffusion (Figure 1) is followed using a denoising objective. In simpler terms, the noise that is added in the forward diffusion needs to be removed.

This is amazing, since unlike GAN (which has training difficulties due to the adversarial relationship between the generator and discriminator networks), there is a simple loss function (denoising objective) that has to be minimized.

Figure 1: Forward arrow shows the forward diffusion process. Reverse arrow marks the reverse diffusion process, credits: CVPR 2022 Tutorial)

Let me reiterate, this denoising objective is a big deal since it eliminates the need for adversarial components. Due to this objective, there is a single denoising network, leading to a more efficient and manageable training process.

But, how the heck do you even denoise an image which is nothing but noise? How does that denoising objective look? How can we think of getting to such a denoising objective function?

In this post, I would try to answer some of the questions asked above related to the denoising objective. Please note, that there are already many beautiful blogs written about how to code it (by the authors themselves, by hugging face), so this blog will talk about how to develop intuition to get to the objective function.

Some basics, before diving into the processing of reaching to the objective function:

Figure 2: Basics of Gaussian(Normal) distribution

Gaussian Distribution and Noise (Figure 2): In Figure 2, N(μ, σ²) represents a Normal (Gaussian) distribution with mean μ and variance σ². The mean (μ) indicates the center of the distribution, while the variance (σ²) determines the spread or dispersion of the distribution. A higher variance implies a greater spread in the noise values, while a lower variance indicates a tighter clustering of noise values around the mean. Equation 1 of Figure 2, represents the re-parametrization trick using which the random samples can be generated from the standard normal distribution. How? All due to ε, which represents a random variable sampled from the standard normal distribution N(0, 1).
Representing/Fixing the variance by a linear equation: From Figure 2, we know that N(μ, σ²) represents a normal (Gaussian) distribution with mean μ and variance σ². Let us say that σ²_start is the initial variance at the start of the diffusion process. There are T total diffusion time steps and at every time step, I want to increase the variance linearly. Can I do that? Of course, right? Represent variance by a linear equation such as σ²(t) = σ²_start + (σ²_end — σ²_start) * (t / T), where t is the current diffusion step (0 ≤ t ≤ T) and σ²_end is the final variance at the end of the diffusion process. Can we say σ²(t) is known as we are increasing it at the linear rate at each time step? Yes.
Time steps: say t denotes the current time step, t-1 denotes the previous time step, and 0 denotes the initial time. Why is this needed? In the forward diffusion process, we are adding noise, right? Say at 0 timestep, we had the original image. we added some noise at time step 1, and so on. Assume, the current time step is t.
Forward Diffusion process: Transforming an original image into a complete noise. (Figure 1). As explained in point 2, assume in forward diffusion we started from time 0 and the current/last time iteration is time step t.
Reverse Diffusion process: Reverse noise and go back from a completely noisy image to the original image. From point 3, we know that the forward diffusion process has left as at time step t. Where do we go back from there? Obviously at time step 0, why? (figure 1). So in the reverse diffusion process, we will go back from time step t to time step 0.
Two simple functions(Figure 3): First one is ‘q’ marked as number 1 and the second one is ‘p’ marked as number 2 in Figure 3. What are these?
Thinking behind the first function: Say we have the information about the image at a time ‘t-1’, the first function ‘q’ is defined for the forward process. It estimates the output image at the current time t, given the image at the previous time step (t-1). Isn’t it logical? If i know the image at time t-1 and I know what I am adding/diffusing into the image between each time step, I will know about the output image at the current time. But how do we know what we are adding between each time step?

Figure 3: Definitions for forward and reverse diffusion process

Thinking behind the second function: Say we have to information about the image at a time ‘t’, second function ‘p’ is defined for the reverse process. It estimates the image at time step t-1, given the image at the previous time step (t).

This isn’t logical at all and mind-blowing simultaneously. How the heck can I move towards removing the noise and generating an image which becomes more meaningful with each time step in the reverse process?

7. Definition of the first function (Figure 3 and 4): We said above, that in forward process, we are adding noise which is sampled randomly from the normal distribution. In the DDPM paper, it is defined as below (Figure 4):

Figure 4: Definition of forward diffusion process in terms of Normal distribution

What the heck is β(t) above in Figure 4? Why is it appearing in mean and variance in the forward diffusion process? In the DDPM paper, β(t) is the noise level (variance) at step t according to the noise schedule. Yes, as explained in point 2, noise can be scheduled. Also, as explained in point 2, in the DDPM paper, noise is scheduled linearly.

Explanation of Mean (Figure 4): The second term in the mean indicates how the image at a time interval (t-1) is scaled down before adding noise. When β(t) is small, the scaling factor sqrt(1 — β(t)) is close to 1, meaning the data remains mostly unchanged. As β(t) increases, the scaling factor decreases, causing the data to be scaled down more significantly. This scaling ensures that the added noise has a larger effect on the data as the diffusion process progresses.

Explanation of Variance(Figure 4): β(t) * I indicates the covariance matrix representing the amount of noise added to the data at step t. The noise is assumed to be isotropic, meaning it affects all dimensions equally, which is why it is multiplied by the identity matrix. The covariance determines the spread of the noise, with a larger β(t) leading to a wider spread and a more significant impact on the data.

β(t) appears in both the mean and covariance of the forward diffusion process equation to control the scaling of the data and the amount of noise added at each step. By following the noise schedule β(t), the forward diffusion process gradually corrupts the original data with increasing levels of noise.

8. Definition of the second function (Figures 3 and 5): In the forward process, we were adding noise and defined the forward process with help of normal distribution. Carrying forward the same intuition for the reverse process, can we say that can also be defined with the help of normal distribution? (Answer is Yes). Let us say, we represent it as shown below:

Figure 5: Definition of reverse diffusion process in terms of Normal distribution

However, now as can be seen from Figure 5, we are not aware of the mean of this normal distribution/transition function (moving from image at time t to t-1). However, mean μ is unknown and σ² is known (as explained in point number 5).

Thus, all we have to do to get μ to figure out the reverse diffusion process represented by function p (figure 5). How can we do that? let us invoke Neural Network. What if we parametrize μ? Let us say I represent μ in the reverse diffusion process by the following function:

Figure 6: Formulation of mean in reverse diffusion process as a neural network

In Figure 6, f is a neural network that takes x_t and the time step t as inputs and produces the mean μ(x_t, t) as output. By using a neural network, the model can learn complex, nonlinear transformations of the noisy data that are necessary to recover the original data. Therefore, in DDPM, authors have parametrized the μ, and thus equation in Figure 6, can be re-written as:

Figure 7: Definition of reverse diffusion process in terms of parametrized Normal distribution

Thus, I will put a neural network in between time step t in the reverse diffusion process to reach time step t-1 (Figure 8). In DDPM, to represent the reverse process, the U-Net backbone is used(Figure 8).

Figure 8: Visual representation of how to reverse diffusion works. Reduction in noise from time t to time t-1.

…since this is getting bigger, objective function derivation, is to be continued in part 2 and part 3

(link to part 2)…

(link to part 3)…

If you like it or find it useful, please clap and share.

Understanding Diffusion through Denoising Diffusion Probabilistic Models (DDPM), Part 1

Welcome to Part 1 of the exciting blog series, where we’ll dive deep into the world of diffusion models — a cutting-edge class of generative models that are taking the AI landscape by storm!

This is amazing, since unlike GAN (which has training difficulties due to the adversarial relationship between the generator and discriminator networks), there is a simple loss function (denoising objective) that has to be minimized.

Let me reiterate, this denoising objective is a big deal since it eliminates the need for adversarial components. Due to this objective, there is a single denoising network, leading to a more efficient and manageable training process.

This isn’t logical at all and mind-blowing simultaneously. How the heck can I move towards removing the noise and generating an image which becomes more meaningful with each time step in the reverse process?

Written by Luv Verma