Simple Review: High-Resolution Image Synthesis with Latent Diffusion Models

Jeongwon
4 min readApr 7, 2023

--

Most images in this post are from the High-Resolution Image Synthesis with Latent Diffusion Models paper.

There are four main steps (texts from the [2]):

  1. The first step is to extract a more compact representation of the image using the encoder E located in the upper left corner of the figure above. Unlike other methods, latent diffusion works in the latent space defined by the encoder rather than in pixel space.
  2. Next, Gaussian noise is added to the image in the upper middle part of the figure as part of the diffusion process that goes from z to zT (in case T steps of noise addition are applied).
  3. The zT representation is then passed through a U-Net located in the middle part at the bottom of the figure. The U-Net has the role of predicting zT-1, and this process is repeated T-1 times until we arrive at z, which is then returned from latent space to pixel space via the decoder D.
  4. Finally, the approach allows for arbitrary conditioning by mapping various input modalities such as semantic maps or text. This is achieved by first transforming the input y with a dedicated encoder τθ and then mapping it to the intermediate layers of the U-Net with the same cross-attention layer used by the Transformer architecture.

========================================================

To compress the image as the latent vector z, they used VQGAN.

VQGAN Structure

Here is how to use latent vector (code book) in VQGAN:

========================================================

Diffusion model: denoise (or noise) model from x(or x_t) to x_t(or x)

Instead of using all image information, they used latent information. And they used UNet as their diffusion model.

Inside their denoising process, they added conditioning information using cross-attention layers. This conditional information y is encoded by \lambda_\theta. The y can contain semantic maps, text, representations, and images. Each step t, encoded information is adjusted as shown in Equation (3).

========================================================

Diffusion Model [5]:

========================================================

This post is to summarize this paper for me to understand.

If there is something wrong with this post, please let me know and explain that.

References:

[1]

[2]

[3]

[4]

[5]

--

--