Summarizing the Evolution of Diffusion Models: Insights from Three Research Papers

5 min readJul 13, 2023

In the world of machine learning, generative models have been making waves with their ability to create new data instances that resemble the training data. Among these, diffusion models have emerged as a powerful class of generative models. In this blog post, we will summarize the findings of three key research papers that have contributed significantly to the evolution of diffusion models.

This blog does not covers any mathematical details, but gives you an understanding of how diffusion models evolved through the brief overview of the below three research papers. Please go through this overview to get a basic understanding of diffusion models and then do try to read these research papers in the order given below. This will help you get into the domain of Diffusion Models and Generative AI in Computer Vision

Denoising Diffusion Probabilistic Models by UC Berkeley

The first paper discusses the groundwork for diffusion models. It introduced the concept of using a random walk or diffusion process in the space of data instances to transform a simple random noise into a complex data distribution. This transformation is guided by a neural network, which is trained to gradually shape the noise into the target data distribution.

The training process of a diffusion model involves two key steps:

Forward process: The model starts with the original data and gradually adds noise to it, diffusing the data until it becomes a simple Gaussian distribution.
Reverse process: The model starts with the Gaussian noise and gradually refines it, step by step, to generate a sample that resembles the original data. The model learns to predict the next step in the reverse process based on the current state, effectively learning to undo the diffusion process.

The authors also discuss the training procedure for the diffusion model. They describe how the model is trained using a combination of maximum likelihood estimation and stochastic gradient descent. The training process involves iteratively updating the parameters of the neural network to minimize the discrepancy between the generated samples and the original data.

Diffusion Models Beat GANs on Image Synthesis by OpenAI

The second paper, titled “Diffusion Models Beat GANs on Image Synthesis” by OpenAI, builds upon the foundational concepts of diffusion models and introduces a new method for improving the quality of generated samples.

The paper begins by comparing the computational requirements of their models with other generative models such as StyleGAN2 and BigGAN-deep. The authors claim that their models can achieve better Frechet Inception Distance (FID) scores, a measure of the quality of generated samples, with a similar computational budget.

Authors then introduce the concept of a conditional diffusion process, which allows for conditional sampling with a transition operator proportional to the product of two terms: an approximation of the reverse process and the label distribution for a noised sample. This approach allows the model to generate samples that are more closely aligned with the desired output.

Also, they have developed a technique for utilizing classifier gradients to guide a diffusion model during sampling. It was discovered that adjusting one specific hyperparameter — the scale of the classifier gradients — can be tuned to trade off diversity for fidelity.

The results demonstrated that

Diffusion models can obtain better sample quality than state-of-the-art GANs.
On class-conditional tasks, the scale of the classifier gradients can be adjusted to trade off diversity for fidelity.
Integrating guidance with upsampling enables further enhancement of sample quality for conditional image synthesis at high resolutions.

High-Resolution Image Synthesis with Latent Diffusion Models

The developers of Stable Diffusion models decided to address the problem of high computational cost and expensive inference in diffusion models (DMs), already known for their state-of-the-art synthesis results on image data.

In a recent paper titled “Stable and Expressive Latent Diffusion Models”, researchers from the Technical University of Munich have proposed a new approach to improve the performance of diffusion models. They introduced a new class of models called Latent Diffusion Models (LDMs) that operate in a learned latent space. This approach allows the model to capture more complex patterns and generate high-quality samples.

What are Latent Diffusion Models?

Latent Diffusion Models are a novel class of models that operate in a learned latent space, as opposed to the pixel space where traditional diffusion models operate. The researchers trained an autoencoder to learn this latent space, and then trained the diffusion model in this space. This approach allows the model to capture more complex patterns and generate high-quality samples.

How are LDMs Trained?

The authors proposed a new method of training the autoencoder in an adversarial manner, which helps to avoid arbitrarily scaled latent spaces. They experimented with two different regularization methods: a low-weighted Kullback-Leibler term and a vector quantization layer. This approach allows for more efficient and stable training of the models.

The research group suggested separating training into two distinct phases:

Training an autoencoder to provide a lower-dimensional and perceptually equivalent representational space.
Training diffusion models in the learned latent space, resulting in Latent Diffusion Models (LDMs).

Post-hoc Image-Guiding

The paper also introduces a new method for conditioning the diffusion models, which they call “post-hoc image-guiding”. This method allows the model to be conditioned at test time, which can be used for tasks like image-to-image translation. This is a significant advancement as it allows the model to adapt to new tasks without requiring retraining.

Performance of LDMs

The researchers conducted several experiments to demonstrate the effectiveness of their proposed methods. They showed that their LDMs can generate high-quality samples in various tasks, including unconditional and class-conditional image synthesis, text-to-image synthesis, layout-to-image synthesis, and super-resolution. They also showed that their models can be trained more efficiently and stably than traditional diffusion models.

Key Takeaways

Diffusion models are a class of generative models that gradually transform a simple random noise into a complex data distribution.
Diffusion models can obtain better quality than state-of-the-art GANs.
Latent Diffusion Models (LDMs) operate in a learned latent space, allowing the model to capture more complex patterns and generate high-quality samples.
The “post-hoc image-guiding” method introduced in LDMs allows the model to be conditioned at test time, enabling it to adapt to new tasks without requiring retraining.