A friendly Introduction to Denoising Diffusion Probabilistic Models

Antony M. Gitau
9 min readJul 9, 2023

I recently attended a Nordic probabilistic AI school, ProbAI 2023, which inspired my interest in generative models. I'm building an understanding as I document and share my learning in this exciting space of “writing” computers.

In this first series of short write-ups on denoising diffusion probabilistic models (DDPMs), I want to demystify DDPMs by giving a general context about these classes of models followed by a vivid example with easy-to-grasp maths formulas.

What are DDPMs?

They are a class of generative models that work by iteratively adding noise to an input signal (like an image, text, or audio) and then learning to denoise from the noisy signal to generate new samples. Huh, let's break that statement down and then give a step-by-step example of the process.

Generative models

Are a type of model that can generate new data instances. Previously, machine learning models have done a good job of learning differences in data and then making predictions or classification tasks. For example, a model trained on a digits dataset like MNIST can recognize a 0 from a 1. Generative models, on the other hand, learn the distribution of digits and can create a “fake digit” which closely resembles a real digit.

Fig 1. From a Machine Learning course by Google

From Figure one above, you can observe that the previous class of models trained to predict the class label of an input based on features (aka discriminative model) is trying to differentiate a 0 from a 1 without really caring about the data space. The discriminative model just draws a line between the two classes to mark the difference.

In contrast, generative models are trying to model how data is placed throughout the space to generate new samples that fall close to the real digits. This is what we shall be aiming to do with DDPMs — to learn the distribution of a training data sample, then generate a new sample that closely resembles it. One importance of an understanding of the data distribution is that it gives us the notion of uncertainty in our prediction or classification tasks which is valuable in building robust intelligent systems. There exist more benefits of generative models, some are highlighted in this 14 minutes video and you can find many more on the internet.

Types of Generative Models

Before we dive deeper into DDPMs, let’s take a step back to appreciate different techniques developed to tackle the task we just outlined earlier — training models that can learn data distribution and then generate new samples.

Fig 2. By Cosmia Nebula

The chart in Figure two above shows two main types of generative models that can learn via the principle of maximum likelihood as described in this NIPS 2016 Tutorial: Generative Adversarial Networks Article.

  1. The explicit density models define a density function (aka a probability distribution) over the data. This means that the model can calculate the probability of any given data point, which can be used to generate new data points that are likely to be similar to the training data. These models can easily be calculated. However, some can be intractable hence a need for employing variational or Monte Carlo approximations to make the calculation easier.
  2. Implicit density models do not explicitly define a probability distribution over the data. Instead, they provide a way of interacting with the probability distribution indirectly. For example, an implicit model like GAN learns by comparing new data points to the training data. So GANs learn to distinguish between data points that are likely to be drawn from the same distribution as the training data and data points that are not. However, implicit models can be more difficult to interpret because the model does not explicitly define a probability distribution, so it is not always clear how the model is generating new data.
Fig 3. From a Deep Learning Course at VU University of Amsterdam
Fig 4. From Lilian Weng's Blog

Generative models have recently become very popular due to advances in training, sampling, and inference techniques. The illustrations in Figures three and four above show two graphs showing types of generative models.

With an appreciation of generative models, let's now get into some actual experimentation on diffusion models popularized by Jonathan Ho’s DDPM paper. I also found an inspirational story about how DDPMs were inspired by the physics principle of nonequilibrium thermodynamics, which governs phenomena like the spread of fluids and gases.

Let's implement a simple diffusion and denoising experiment on some handwritten digits.

Jonathan Ho proposed a two-step process: a noising—diffusion process — and a denoising process — reverse process — as we mentioned earlier under what DDPMs are.

Diffusion process

So, we have an original input signal of a digit from the MNIST dataset as shown in the square grid below in Figure five. The digits are in high dimensions meaning that they have a more complex distribution as illustrated in Figure six. Ho proposed that we reduce the complexity by adding noise to these digits until the structure of the data distribution is a simple Gaussian noise that is easy to sample from. So this process of systematically and slowly destroying the structure in the original input data distribution is called the diffusion process.

Fig 5. Image by Author
Fig. 6. From CVPR Tutorial on DDPM

Implementation of the diffusion process

Fig 7. Image by Author

To get the noisy digits from the original digits as illustrated in Figure seven, we did need three pieces:

  • Defining the diffusion steps. Simply that means how many times we are going to add noise to the original image. This included the minimum and maximum noise scheduler (betas) values, that is the lowest and the largest amount of noise we can add to the input data. In the original work by Ho et. al., betas are put in a linear space from 0.0001 to 0.02 with 1000 diffusion steps and that is what we also used.
  • Generate a Gaussian noise of the same shape as the input data and added the noise iteratively as shown in Figure seven with “t” diffusion timesteps.
  • Sampling from the Gaussian using the formula shown in Figure eight and visualized the samples.

Below is a draft of the code implementation and the diffusion kernel (aka the diffusion sampling formula) is shown in Figure eight.

 def forward_diffusion_process(self, x_0, t, eta):
"""
This is just a skeleton of what our forward diffusion function resembled

Args:
x_0: original input image
t: current timestep
eta:gaussian noise to pertub the input image

Return:
Diffused samples
"""
# extract number of samples, channel, heigh and width from input image
n,c,h,w = x_0.shape

#get the alpha bars at specific timestep t
a_bar = self.alpha_bars[t]

#generate a gaussian noise of the same shape as input image
eta = torch.randn(n,c,h,w).to(self.device)

#apply the sampling fomular in figure 8 below
noisy = a_bar.sqrt().reshape(n,1,1,1) * x_0 + (1- a_bar).sqrt().reshape(n,1,1,1) * eta

#we return the diffused image
return noisy

def visualized_forward_process(ddpm, loader, device):
"""
Args:
ddpm: the instance of the diffusion denoising model
loader: the training data
device: the GPU we use for training

Return:
the noisy images
"""
for batch in loader:
images = batch[0]
for noise_addition in [0.25,0.5,0.75,1]:
show_images(ddpm(images.to(device), [int(noise_addition * ddpm.diff_steps) - 1 for _ in range(len(images))]), f"Noisy images {int(noise_addition * 100)}%")
break
Fig. 8. From CVPR Tutorial on DDPM

Reverse Process

Now that we have a simple distribution from the diffusion process, we can then learn a reverse process of the diffusion process that restores structure in data and yields a highly flexible and tractable generative model of the data as illustrated in Figure nine.

In this reverse process, we put together several pieces together to generate these noisy samples shown in Figure 10 and Figure 11 is the graphics interchange format (GIF) of Figure 10. They are noisy because we just sampled without training the sampler to learn how to denoise which is how much to remove from the noise to generate surreal digits. We will train the model and show the results in the next blog. Hopefully, if things go well with training, we’ll be able to see more realistic digits.

Fig. 9. From CVPR Tutorial on DDPM
Fig 10. Image by Author
Fig 11. Image by Author

Implementation of the reverse process

To generate realistic samples such as those illustrated earlier in Figure nine, we needed three pieces; Gaussian distribution parameterization, model architecture, and training. We’ll not discuss training in this blog.

Model architecture: Diffusion models often use U-Net architectures with ResNet blocks and self-attention layers to represent. For example, Ho et al paper used a U-Net based on a Wide ResNet with four feature map resolutions with two convolutional residual blocks per resolution level and self-attention blocks.

We created a custom U-Net network with 3 down-sample blocks, a bottleneck in the middle, and 3 up-sample blocks with concatenations with inspiration from Brian Pulfer’s work. And also used sinusoidal positional embeddings for time representation.

Gaussian distribution parameterization: We just need the model to predict the distribution mean and standard deviation given the noisy image and time step. Ho et al only predicted the mean of the Gaussian and that is what we also did and had the variance fixed.

During the forward process, we use beta values to control the variance of the forward diffusion and now during reverse denoising processes, we’ll use sigma shown on the denoising formula in Figure nine. Often linear schedules, betas, are set equal to sigma squared as we show on the code draft below.

Fig 12. From Ho et al DDPM paper
def reverse_process(ddpm, n_samples=64, c=1,h=28,w=28,device=None):
"""
The draft of our reverse process and its a more detailed outline of the
pseudocode in the figure 12.

Args:
ddpm: instance of the denoising and diffusion model
n_samples: number of samples to be generated
device: to be used for training
c, h, w : number of channels, the height, and the width of the images as input

Return:
newly generated samples
"""

# we create a random noise tensor with the same shape as the input image
x = torch.randn(n_samples, c, h, w).to(device)

for t in range(ddpm.diff_steps)[::-1]:
"""
loop through all the timesteps from the noisy image towards original image
estimate the noise to be removed using the reverse process
apply partially denoising formular to the image by subtracting the estimated noise
add some noise to the image
"""
time_tensor = (torch.ones(n_samples, 1) * t).to(device).long()
eta_theta = # call the UNet model to predict the distribution of mean
#we won't show the implementation of UNet in this blog.

# controls the amount of noise that is removed from the image
alpha_t = ddpm.alphas[t]

# regularization that is applied to the image
alpha_t_bar = ddpm.alpha_bars[t]

# denoising the image using the formula in the pseudocode in Figure 12
x = (1 / alpha_t.sqrt()) * (x - (1 - alpha_t) / (1 - alpha_t_bar).sqrt() * eta_theta)

# adds some noise to the image if the time step is greater than 0
if t > 0:
z = torch.randn(n_samples, c, h, w).to(device)

# sigma_t_squared = beta_t as we described in gaussian parameteration paragraph above
beta_t = ddpm.betas[t]
sigma_t = beta_t.sqrt()

# add some more noise in langevin dynamics way
x = x + sigma_t * z

#to visualize the generated samples, we used imageio. You can check how
#Brian Pulfer implemented the visualization.

Conclusion

We have introduced the idea of generative models including highlighting a popular class of deep generative models: GANs, VAEs, and Flow-based models. We also touched on the importance of generative models. Then we focused more on DDPMs by defining their operation and showing samples of digits that we diffused and denoised using PyTorch, a deep-learning framework. We did not include a lot of mathematical aspects of the two processes nor did we train the reverse process of the DDPM model.

In our next steps, we will be showing the significant difference in training the reverse process in terms of the quality of generated samples. We also aim to cement the understanding of how these models work by diving deeper into mathematical proofs.

References

I’m grateful for the work of other people that helped me achieve this implementation.

--

--