A new SotA for generative modelling — Denoising Diffusion Probabilistic Models

Published in

Graphcore

10 min readOct 7, 2021

Author: Sebastian Orbell, PhD Student at University of Oxford, Department of Materials

Generative models create latent representations, which distil information from big data in order to generate realistic and novel data points. In the long term, these models could be vital in developing accurate world models, as well as learning categorical and continuous features of a dataset in an unsupervised way. Currently, generative models are demonstrating their value in a variety of downstream tasks such as inpainting, super-resolution, and generating continuous exploration spaces for reinforcement learning. Generative Adversarial Networks (GANs) have represented the state of the art (SotA) for some time, however recently OpenAI has published results that make a strong case for a new era of Denoising Diffusion Probabilistic models dominating generative SotA applications.

In this article, I shall introduce the theory behind this method and describe the contributions which have enabled this relatively unstudied technique to topple GANs. I will then describe how to leverage the capabilities of specialist AI hardware, namely the Graphcore IPU, using their custom TensorFlow framework. Lastly, I will discuss the future of denoising diffusion models and the role of specialist AI hardware in the progress of machine learning research.

Selected samples from an ImageNet 512×512 model

Introduction

Denoising Diffusion Probabilistic Models are a class of generative model inspired by statistical thermodynamics (J. Sohl-Dickstein et. al.) (for clarity I shall now refer to them as diffusion models). The general concept is to sequentially destroy information in a distribution of data through a forward diffusion process. A model is then trained to reverse this diffusion process and thus recover information, much like Maxwell’s eponymous demon. This method enables a model to efficiently learn and sample from probabilities in deep neural networks, in addition to producing conditional and posterior probability distributions from the learned model. Sampling from these models enables the generation of realistic data via a Markov chain process, conditioned on white noise.

Summary of a training step in the diffusion framework

Diffusion models are parametrised Markov chains, which reverse a diffusion process and thus generate information. If this parameterisation is constructed so that the model predicts the gaussian noise added in the forward diffusion step, an equivalence can be demonstrated between denoising diffusion and annealed Langevin dynamics. This parameterisation also reduces the variational lower bound to an objective shared with denoising score matching methods. Crucially, utilising this specific parameterisation of the Markov chain (identified by J. Ho et al.) produces the highest sample quality.

It is notoriously difficult to evaluate image synthesis quality when studying generative models. A popular metric, however, is the Frechlet Inception Distance (FID), which evaluates the distance between the feature vectors of real and fake images. A pre-trained Inception V2 model is used to extract this feature set.

**FID scores** with respect to image augmentation

Recent Contributions

After a seminal paper by J. Ho et al., there has been a flurry of recent work in the field of diffusion methods, with two notable contributions from OpenAI in “Improved Denoising Diffusion Probabilistic Models” and “Diffusion Models Beat GANs on Image Synthesis”. Here, I will outline the key contributions from these papers.

(i) Variance prediction

The model variance can be expressed as an interpolation between its theoretical bounds, and a neural network can therefore be used to parameterise the interpolation factor. This method proved to be more stable than using neural networks to learn the variance directly, and the log likelihood was improved in comparison to methods using a fixed model variance parameter.

(ii) Noise scheduling

Dhariwal and Nichol recognised that a significant portion (20%) of a linear noise schedule can be removed without an impact on sample quality and, therefore, propose a cosine schedule. They demonstrated that a model will learn more throughout the diffusion process using this adapted noise scheduler.

Latent samples from linear (top) and cosine (bottom) schedules respectively at linearly spaced values of t from 0 to T. The latents in the last quarter of the linear schedule are almost purely noise, whereas the cosine schedule adds noise more slowly.

(iii) Learning objective (parameterisation)

Intuitively, the learning objective for a denoising diffusion model could be parameterised by the mean of the un-noised sample. However, experiments have shown that parameterising the model by the mean of the additional noise leads to more stable training and higher quality image synthesis. This result may be due to the simpler distribution function required to represent the noising process as opposed to the natural image distribution function.

(iv) Loss function

When using a fixed value to express the model variance, a loss function defined by a reweighted variational lower bound, was observed to produce the highest image quality. However, this reweighting does not include a term for the variance. Therefore, in order to learn the interpolation factor for the model variance, a hybrid loss function can be defined by a weighted linear combination of the true variational lower bound and mean squared error. When the relative weight of the true variational lower bound was set to 0.001, the authors achieved their highest image quality. This hybrid loss function also effectively reduced the noise levels in the gradients of the model during training.

(v) Model architecture

Ho et al. introduced the use of a UNet architecture into the diffusion model framework, which proved to be crucial for improving image synthesis quality. Dhariwal and Nichol made further improvements to the architecture by adding a global attention layer and a projection of the timestep embedding into each residual block.

(vi) Classifier guidance

Inspired by the success of GANs in class conditional synthesis, Dhariwal and Nichol employ a classifier, trained on noisy images, to guide the diffusion sampling process. Here, a pre-trained diffusion model is conditioned using the gradients of the classifier to guide the sampling process towards an arbitrary class label. The scale of the classifier gradients can be utilised to effectively balance the trade-off between diversity and fidelity of the samples.

(vii) Reducing the number of sampling steps

The length of the Markov chain should be as large as possible so that the Gaussian conditional distribution model of the generative process becomes a good approximation. However, this requirement makes diffusion methods prohibitively expensive, so some effort has been made towards reducing the number of sampling steps.

Dhariwal and Nichol recognised that a model trained with 4000 diffusion steps could be sampled from using as little as 100 steps without significantly reducing image synthesis quality. To reduce the number of sampling steps, evenly spaced real numbers are selected within the range 0 and T. The Markov chain is then run through the model with these time embeddings and the sampling variance is automatically rescaled according to the noise schedule.

Song et al. have constructed a method, termed diffusion denoising implicit models (DDIM), which can use an arbitrary number of steps in training, by parameterising the forward process as a non-Markovian. This enables them to model the reverse process as an implicit probabilistic mapping. The consequence of this is that the generative process can be achieved with far fewer sampling steps, without a significant reduction in image synthesis quality.

Hardware focused model optimisation

The IPU (Intelligence Processing Unit) is distinct from its competitors in many different ways. Most significantly, the IPU has been designed with huge on-chip, high-bandwidth memory located adjacent to the 1,472 processor cores, on the silicon die. As each core has 6 threads, each IPU is capable of running 8,832 processes concurrently, allowing for massive fine-grained parallelisation. This parallelisation is of significant benefit when sampling from diffusion-based models, where an iterative Markov chain must be computed. The IPU also has an on chip pseudo-random number generator which enables a significant speed up during training in diffusion-based models, where the time-step and added noise must be sampled at each training iteration.

Graphcore has engineered a proprietary software stack to best utilise the IPU’s architectural advantages. The model can be implemented in popular frameworks, such as TensorFlow, and will be converted to a static graph by the compiler, which will optimally distribute the computation across the IPU.

Improving FID score with the model size and compute resources

As with many contemporary SotA machine learning methods, the performance of diffusion models scales with the model size and the compute resources (see above figure). Larger models do not fit on a single chip and require better engineered solutions to distribute them efficiently across multiple chips. Graphcore has developed a pipelining strategy to optimally distribute the model across multiple computational stages, with each stage allocated to a different IPU.

The training can be optimally achieved by utilising Graphcore’s scheduler which enables maximum usage of the available devices. Here, the forward passes are grouped to run in parallel across the devices, followed by the grouped backward passes. This scheme ensures that the compute load on each device is similar at all times as backward passes tend to be more computationally demanding.

An example of a grouped pipeline schedule

Memory usage can become prohibitive when training very large networks, however, hardware-aware software-optimisations can mitigate these issues. Graphcore’s Poplar SDK enables recomputation of activations in the backward pass which significantly reduces the number of activations that need to be saved between the forward and backward passes. This compromise between compute and memory demand is fundamental in the optimisation of contemporary machine learning workflows targeted at specialist hardware. Graphcore’s software stack provides developers with a suite of tools to exercise fine grained control of this balance between compute and memory resources.

Other uses for diffusion models:

The renewed interest in diffusion denoising methods has seen them become competitive in a variety of domains. Here, I will briefly outline some of the most interesting examples.

Multivariate probabilistic time series forecasting

Rasul et al. propose TimeGrad, a method that learns the gradient of the data distribution by optimising the variational lower bound, and then generates samples by feeding white noise through a Markov chain that follows Langevin dynamics. Using a model architecture defined by LSTM, residual, and dilated convolutional layers, the authors achieve SotA results across six popular time series datasets.

Prediction intervals and ground truth values across two dimensions in a traffic dataset. — Prediction intervals and ground truth values across two dimensions in a traffic dataset

Image-to-image translation

Sasaki et al. achieve state of the art results in image-to-image translation tasks across multiple public datasets (SotA, beating CycleGAN). They simultaneously train two denoising networks, one conditioned on the source domain, and one on the target domain. This approach allows for stable (non-adversarial) training, whereby the joint probability distributions over both domains is learnt by optimising a denoising score matching objective conditioned on the other domain. Thus, an image target can be generated by a denoising Markov Chain Monte Carlo approach, based on Langevin dynamics and conditioned on the source image.

An illustration of the image translation approach

Class conditional image generation

Ho et al. report SotA results in class conditional image synthesis (outperforming BigGAN-deep and VQ-VAE-2) by designing a cascade of diffusion models. Cascading is a technique whereby a high-resolution data distribution is learned by a pipeline of separately trained models at multiple resolutions. They found that image synthesis quality from cascading pipelines could be effectively improved by training each super-resolution model using low-resolution inputs subject to some form of data augmentation. They claim that this alleviates compounding error in the cascading pipelines (exposure bias).

Image super-resolution

Saharia et al. adapt a diffusion model to tackle the task of image super-resolution. In this instance, the reference low resolution image is interpolated to the target resolution and concatenated to the noisy target image to form the input of the UNet. Successive networks can then be cascaded together to synthesise very high-resolution images.

Super-resolution via iterative refinement

Improvements by sampling from non-Gaussian noise distributions

Eliya et al. explore the efficacy of sampling from non-Gaussian noise distributions. They find that fitting distributions with more degrees of freedom, such as mixtures of gaussians, can improve the performance of generative models.

The future

The progress and interest in the field of denoising diffusion models is rapidly gaining momentum. We are, therefore, likely to see many more papers presenting incremental improvements to the SotA algorithms for a variety of downstream tasks. It looks probable that diffusion models, once properly investigated and optimised, will topple GANs from their dominance over competitive generative models.

As machine learning models become increasingly large, the compute resources required to train them will become prohibitively expensive. Using conventional hardware, this scenario will make a thorough investigation of new architectures or methods impossible. Therefore, specialised AI hardware such as the IPU will not only accelerate research but also open new avenues of research.