Stable Diffusion: Theory and Applications

Published in

CJ Express Tech (TILDI)

9 min readJan 11, 2023

Have you heard of Stable Diffusion? If not, you must have seen this image circulating the internet captioned: Jeflon Zuckergates. The face looks similar to Jeff Bezos, Elon Musk, Mark Zuckerberg and Bill Gates combined.

This creation from text to image is powered by an AI model, Stable Diffusion. Stable Diffusion is a text-to-image model built upon the works of Latent Diffusion Models (LDMs) combined with insights of conditional Diffusion Models (DMs). The best part is that the model and the source code are released for public use under a Creative ML OpenRAIL-M license. Stability AI in collaboration with Runway, CompVis, Eleuther AI and LAION came together in bring forth the Stable diffusion public open-source release. This effort has been in line with the idea of democratizing and making AI accessible to everyone by many industry leaders in the AI ecosystem. Within a month of the public release announcement on the 22nd of August, 2022, the open-source community contributed to improving the model further and started exploring its full capacity for many other tasks.

However, this release of Stable Diffusion was soon surrounded by lot of criticism for its ability to generate NSFW images to celebrity images. Further, though the model was open-source, it was partly trained on a dataset that is not publicly available. In less than 3 month, Stability AI released another version, Stable Diffusion v2, addressing the above mentioned issue and also added more capabilities. Here is a detailed blog on the main difference between the versions.

In this article, we will start by giving a few references to applications made possible using Stable Diffusion v1. In the subsequent sections, we will go over a bit of the theory of diffusion models, latent diffusion models, and stable diffusion followed by a few example images generated using the model.

Applications:

AI creator tools have already been in use for a long time. Notable among them is powered by DALL-E, ImageGen, and others which are proprietary models and not open source. With Stable diffusion public release, many new useful applications and ideas are developed by the open-source community. Some of them are the following:

Cinematographers, VFX editors or anyone can use AI-powered applications to make films [1, 2, 19]
Image editing [20, 22] and video editing [9] can be done much faster.
Image generation from a text can be commercialized quickly giving more control to the user for design choices [3, 10] and product designers to create amazing new designs [11],
Extracting aesthetically pleasing designs and color palettes for different applications, such as web pages, mobile apps, and themes. [4],
MRI brain image dataset [5],
Illustration of stories [6],
Tile [7] and texture [8] generation,
Collage tool for images[16]
Art collection [17]
WebUI with Out-painting [25], In-painting, Prompt matrix, Upscale, Textual Inversion [23, 24], and many more features [18, 21].

Theory

Now we will discuss the theory behind the diffusion models and subsequently the core topic of latent diffusion models on which Stable diffusion is built.

Diffusion Models

Image synthesis or image generation has had a lot of progress in recent years. Deep generative models such as Generative Adversarial Networks (GANs), AutoRegressive Models (ARM), Flows, and Variation AutoEncoder (VAEs) have synthesized high-quality images. However, this synthesis comes at a high computational cost, especially for high-resolution images of complex, natural scenes. Scaling up these models to potentially billions of parameters has been the recipe for modeling complex, multi-model distributions haven’t been totally successful (GANs).

Recently, diffusion probabilistic models [26] which are built from a hierarchy of denoising autoencoders were able to generate impressive high-quality images [27, 28]. The diffusion model is inspired by non-equilibrium statistical physics, is essentially to systematically and slowly destroy the structure in the data distribution through an iterative forward diffusion process. Then learn a reverse diffusion process that restores structures in the data, yielding a highly flexible and tractable generative model of the data.

Forward and reverse diffusion process (Source)

In the above figure, the top row shows the time slices of the forward trajectory on a 2-D swiss roll data. The data distribution (left) undergoes Gaussian diffusion, which gradually transforms it into an identity-covariance (right). The bottom row shows the time slices from the trained reverse trajectory. An identity-covariance Gaussian undergoes a Gaussian diffusion process with learned mean and covariance functions and is gradually transformed back into the data distribution (left). Source: 26

Directed Graphical Model for Denoising Diffusion Probabilistic Models (Source)

Diffusion probabilistic models as stated in [27] is a parameterized Markov chain trained using variational inference to produce samples matching the data after a finite time. Transitions of this chains, pθ(xt-1|xt) are learned to reverse a diffusion process, which is a Markov chain that gradually adds noise to the data in the opposite direction of sampling, qθ(xt|xt-1), until the signal is destroyed. When the diffusion consists of a small amount of Gaussian noise, it is sufficient to set the sampling chain transitions to conditional Gaussians too, allowing for a particularly simple neural network parameterization. The simplified objective function is:

where ∈ θ (xt, t); t = 1,…, T is a denoising autoencoder which is trained to predict a denoised variant of input xt, xt is a noisy variant of input x.

Training and evaluating diffusion models require massive amounts of computational resources, thus limiting the accessibility and usability of such powerful class of models. Like any likelihood model, diffusion models trained in pixel space can roughly divided into two stages: First is the perceptual compression stage that aims to remove high-frequency details but still learns little semantic variations. And the second stage is the generative modeling stage, where it learns the semantic and conceptual components of the data. [29]

Latent Diffusion Models

The above figure illustrates perceptual and semantic compression: most bits in a digital image correspond to imperceptible details. This semantically meaningless information is suppressed by diffusion models (DMs) by minimizing the responsible loss. Along with it the gradients and the neural network backbone still needs to be evaluated on all pixels leading to the high computational cost of optimization and inference. Latent diffusion models aim to address this by separating the training stage in two phases: an autoencoder is first trained that provides a lower-dimensional representational space perceptually equivalent to the data space but with reduced computational complexity. DMs are then trained on the latent space which has better scaling properties with respect to spatial dimensionality. Images can be efficiently generated from the latent space with a single forward pass because of reduced complexity. In addition, the universal autoencoding stage is to be trained only once which can then be used for training multiple diffusion models and other tasks.

Perceptual Image Compression consists of the autoencoder trained by a combination of a perceptual loss [30] and a patch-based adversarial objective [31] that ensures the reconstructions are confined to the image manifold by enforcing local realism and avoid blurriness introduced by relying solely on pixel-space loss as L1 or L2 objectives. This universal autoencoder is trained only once and therefore can be reused for multiple DMs training or for completely different task. Given an image x ∈ RH x W x 3 in RGB space, the encoder E encodes x into a latent representation z = E(x), and the decoder D reconstructs the image from the latent, given x~ = D(z) = D(E(x)), where z ∈ R h x w x C. Note that the encoder downsamples the image by a factor f = H/h = W/w where f = 2m. The high-frequency and imperceptible details are abstracted away by the low-dimensional latent space representation z of an image x. This space is more suitable for likelihood-based generative modeling as compared to high-dimensional pixel space as they can focus on the important, semantic bits of the data and train in a lower dimension which is computationally efficient.

Generative modeling of latent representations consists of a time-conditioned UNet [32] as neural backbone ∈θ (o, t) and trained with an objective function:

The forward process is fixed where zt is efficiently obtained from the encoder E during training, and samples from p(z) is passed through the decoder D to convert to image space. Diffusion models are in principle capable of modeling conditional distributions of the form p(z|y), and can be implemented using a conditional denoising autoencoder ∈ θ (zt, t, y). The input y can be text, semantic maps, or other image-to-image translation tasks. Specifically, y is pre-processed with domain-specific encoders tθ that project y to an intermediate representation tθ(y) ∈ RM x dr which is then mapped to intermediate layers of UNet via a cross-attention layer implementation. The conditional objective function is:

Stable Diffusion

Stable diffusion consists of three main components:

Autoencoder: The Variational Autoencoder consists of two parts, the encoder and the decoder. The encoder is used to convert an image into a low-dimensional latent representation for the forward diffusion process. The decoder converts the latent representation generated by the reverse diffusion process back into an image
U-Net: It has an encoder and a decoder both comprising of ResNet blocks. The encoder compresses an image representation to a lower dimensional image representation and the decoder decodes a lower dimensional image representation to a higher dimensional image representation which is supposed to be less noisy. Shortcut connections are added between the downsampling ResNets of the encoder and upsampling ResNets of the decoder to prevent loss of information during downsampling.
Text encoder: The text encoder is used to encode the text prompt into embedding space representation. Specifically, it uses a pre-trained CLIP [38] text encoder to encode the text prompts. Cross-attention layers are used to condition the U-Net outputs on the text embeddings by adding them to the encoder and decoder part of the U-Net. [33]

Examples:

Image Generation from Text
Prompt: “A fire breathing cat fighting a knight dog taken by a wildlife professional photographer”
Latent representation:

Generated Image after decoding latent:

Image generation from Text: Text embedding is average of the two prompts embedding.
Prompt 1 : “A Dog”
Prompt 2: “A Bird”
Latent representation:

Generated image:

Image generation from text: Using a reference image
reference image:

Prompt: “A colorful dancer, nat geo photo”
Latent representation:

Generated image:

InPaining: Replacing an object with another
Original image and the mask:

Prompt: “A cat on a bench”
Generated image:

For more details on the code execution steps, [33, 35, 36, 37] are useful.

Conclusion

In conclusion, image generation techniques have improved in the last few years. Stable Diffusion, in particular, generates good and aesthetically pleasing images. Using these AI models, numerous creator tools are getting developed every day that allows image and video editing very fast. However, only images could be generated using these modes. This problem has also been solved by Meta AI releasing the first Text-to-video model followed by Google releasing its text-to-video model. We can now look forward to an open source version of these models which will allow creators to use such tools for making exciting products and applications.

References: