Computer Vision Paper ~High-Resolution Image Synthesis with Latent Diffusion Models

Christian Lin
6 min readApr 23, 2023

--

The article discusses the limitations of diffusion models (DMs) for image generation due to their direct operation in pixel space, which makes optimization and inference expensive. To address these limitations, the authors propose the use of DMs in the latent space of pretrained autoencoders, which allows for a better balance between complexity reduction and detail preservation. This approach, known as latent diffusion models (LDMs), also enables the use of cross-attention layers to generate high-resolution images from general conditioning inputs such as text or bounding boxes. The authors demonstrate that LDMs achieve state-of-the-art results in image inpainting and class-conditional image synthesis, and perform competitively in unconditional image generation, text-to-image synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs.

If you want to know more about the fundemenatal knowledge about diffusion models, please feel free to read the following articles.

Motivation

Democratizing High-Resolution Image Synthesis

Diffusion models (DMs) are likelihood-based models that can spend a lot of computational resources on modeling imperceptible details of the data. Training and evaluating such models on RGB images requires significant computational resources, and their mode-covering behavior often results in excessive capacity usage. As a result, DMs are not accessible to many researchers and have a large carbon footprint. To address this, a method is needed to reduce the computational complexity of DMs without impairing their performance, to make them more accessible to researchers and users.

Departure to Latent Space

The researchers’ approach involves analyzing already trained diffusion models in pixel space to find a more computationally efficient space for training diffusion models for high-resolution image synthesis. They divide learning into two stages: a perceptual compression stage and a semantic compression stage.(Following figure demonsrates the rate-distortion trade-off of a trained model) They aim to find a space that is perceptually equivalent to the data space but more computationally suitable for training diffusion models. To do this, they train an autoencoder to provide a lower-dimensional and efficient representational space. The researchers then train DMs in the learned latent space, which exhibits better scaling properties with respect to the spatial dimensionality, resulting in a model class called Latent Diffusion Models (LDMs).

Methodology

The authors propose a method to reduce the computational demands of training diffusion models for high-resolution image synthesis. They observe that although diffusion models can ignore perceptually irrelevant details, they still require costly function evaluations in pixel space. To address this, they introduce an autoencoding model that learns a low-dimensional space that is perceptually equivalent to the image space. This offers several advantages, such as making DMs computationally more efficient and enabling the training of multiple generative models. The resulting model class is called Latent Diffusion Models (LDMs). In the following subsections, I will go through all important approaches applied in LDM.

Perceptual Image Compression

The authors use an encoder-decoder architecture to encode an RGB image x into a latent representation z, and then reconstruct the image x from z. The encoder downsamples the image by a factor f and they experiment with different downsampling factors. To avoid high-variance latent spaces, they use two different types of regularizations: KL-reg. and VQ-reg. The VQ-reg. uses a vector quantization layer within the decoder. By using mild compression rates, the authors achieve better reconstructions than previous works, which relied on an arbitrary 1D ordering of the latent space.

Latent Diffusion Models

Diffusion Models are probabilistic models that learn a data distribution p(x) by gradually denoising a normally distributed variable. They rely on a reweighted variant of the variational lower bound on p(x), which mirrors denoising score-matching. For image synthesis, successful models rely on an equally weighted sequence of denoising autoencoders, trained to predict a denoised variant of their input with following objective:

The authors propose a new way of compressing images to a smaller and more efficient space, while retaining important details. This compressed space allows for faster and more effective generative models. Unlike previous methods that used complex models, this new approach takes advantage of the structure of images to better compress them. The model uses a UNet neural network, which is optimized for compressing images, and can efficiently generate new images with a single pass through the decoder network. Thus, new objective can be conducted as:

Conditioning Mechanisms

Similar to other types of generative models, diffusion models usually poccess the potential ability to model conditional distributions and the possibility of controlling the synthesis process through different types of inputs, such as text or semantic maps. However, in the context of image synthesis, the use of other types of conditioning beyond class-labels or blurred input images is not well researched. To address this, the authors propose augmenting diffusion models with a cross-attention mechanism, which is effective in learning attention-based models of various input modalities. The authors also introduce a domain-specific encoder to preprocess input from various modalities and map it to the UNet via the cross-attention layer.

Here, φ_i denotes the a (flattened) intermediate representation. Based on image-conditioning pairs, we then learn the conditional LDM via

The paper describes the success of the proposed latent diffusion models in improving the training and sampling efficiency of denoising diffusion models without sacrificing quality. In combination with the cross-attention conditioning mechanism, the experiments showed that the proposed models outperformed state-of-the-art methods in various conditional image synthesis tasks without requiring task-specific architectures.

In this article, I briefly share my viewpoints on the paper. I hope you can learn more about it after reading it. I also offer the video link about the paper, hope you guys like it!!!!

If you like the article, please give me some 👏 , share the article, and follow me to learn more about the world of multi-agent reinforcement learning. You can also contact me on LinkedIn, Instagram, Facebook and Github.

--

--

Christian Lin

A master CS student used to work at ShangShing as an iOS full-end developer. Now, I dive into AI field, especially Multi-agent RL and Bio-inspired intelligence.