Unlocking the Power of Stable Diffusion: A Journey from Prompt to Visual Fest

9 min read4 days ago

~ ACM Student Chapter

If you haven’t been living under a rock for the past year, you’ve probably heard about the recent discoveries in AI that are causing schools, corporations, and governments to scurry to stay up with all of the powerful new technology available to us. Stable Diffusion, a novel model designed to generate visuals based on text prompts, has produced works capable of winning art contests.With Stable Diffusion’s growing mainstream popularity, it is vital to better understand this emerging technology. In short, stable diffusion is more than just a tool — it’s a catalyst for the next wave of AI-driven creativity and functionality.

EXAMPLES OF STABLE DIFFUSION-

1.OpenAI’s DALL•E, DALL•E 2, and DALL•E 3 text-to-image models use deep learning techniques to produce digital images from “prompts,” which are natural language descriptions.

2. Firefly uses straightforward text prompts to produce stunning, artistic images utilizing content that has expired its copyright and is in the public domain, open licence, and Adobe Stock.

3. The independent research lab Midjourney, Inc., located in San Francisco, developed and hosts the generative artificial intelligence programme and service known as Midjourney.

SIGNIFICANCE OF STABLE DIFFUSION

Stable diffusion is significant since it is both accessible and simple to utilize. It supports consumer-grade graphics cards. For the first time, anyone can download the model and create photos. You also have control over important hyperparameters like the number of denoising steps and the amount of noise used. Stable Diffusion is simple to use, and no further information is required for image creation. Stable Diffusion has a large community, thus there is plenty of documentation and how-to tutorials , making it one of the most useful tools available.

PREREQUISITES FOR STABLE DIFFUSION

Some computing resources required for Stable Diffusion to work well

RAM storage. At least 8GB RAM and 4GB VRAM
Graphics card. A dedicated graphics card from NVIDIA or AMD
GPU memory. At least 4GB of GPU memory

You must install-

Python
The model’s user interface
Stable Diffusion model

Below is an example of how we code and what output is given.

We have used runwayml/stable-diffusion-v1–5

!pip install diffusers transformers
!pip install torch

from diffusers import StableDiffusionPipeline
import torch

model_id = "runwayml/stable-diffusion-v1-5"
pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]
image.save("astronaut_rides_horse.png")

from IPython.display import Image
Image("astronaut_rides_horse.png")

WHAT ARCHITECTURE DOES STABLE DIFFUSION USE?

The main architectural components of Stable Diffusion include a variational autoencoder, forward and reverse diffusion and text conditioning.

The variational autoencoder has a separate encoder and decoder. The encoder reduces the 512x512 pixel image to a smaller 64x64 model in latent space, which is easier to manipulate. The decoder converts the model from latent space to a full-sized 512x512 pixel picture.

Forward diffusion gradually introduces Gaussian noise into an image until all that remains is random noise. The final noisy image cannot be used to identify the original image. During training, all photos undergo this process. It is only utilized for image-to-image conversions.

Reverse diffusion is an iterative process of reversing the forward diffusion, but with parameters. In practice, model training uses billions of photos and prompts to generate unique visuals.

A CLIP tokenizer examines each word in a textual prompt and embeds it in a 768-value vector. You can utilize up to 75 tokens each prompt. A text transformer is used to transfer these prompts from the text encoder to the U-Net noise predictor.

A latent array is a special way of representing an image, but instead of using pixels (tiny colored dots), it uses numbers to describe more abstract features of the image.

(Step-by-Step working of Stable Diffusion)

APPLICATIONS OF STABLE DIFFUSION

Text-to- image generation is the most typical approach where Stable Diffusion generates an image based on a textual input. You can make alternative images by modifying the seed number for the random generator or the denoising schedule to achieve different effects.

You can use an input image and a text prompt to produce images based on an input image. A sketch and an appropriate prompt are common examples.

You can use it to edit and retouch photographs. Load a picture into AI Editor and mask the area to be edited using an eraser brush or by using a prompt that defines what you want to do.

Stable Diffusion allows you to produce short video clips and animations using tools like GitHub’s Deforum. Another application is to apply many styles to a movie.

What is CLIP?

CLIP stands for Contrastive Language-Image Pre-training. It is a pre-trained model that tells you how well a given image and text caption match together. It is valuable because this pre-trained model may be used for a variety of downstream task i.e. it can also associate an image with a text caption comprising any English-language terms.

(Overview of how clip works during training)

IMAGE INFORMATION GENERATOR:

Stable Diffusion has taken the AI image generation world by storm by its image information creator, the very foundation for its impressive performance. Let’s delve deeper into this fascinating process-

The Powerhouse: UNet and the Scheduler

The image information creator isn’t a single entity; it’s a masterful collaboration between two key elements:

U-Net Neural Network: Imagine a highly skilled artist working on a canvas. The U-Net acts similarly, but on a digital canvas within the computer’s memory. It progressively refines an image, meticulously removing noise and adding details based on the information it receives.

Scheduling Algorithm: Think of a conductor leading an orchestra. The scheduling algorithm plays this crucial role, dictating the pace and order of the U-Net’s operations during the image generation process. It ensures a controlled and gradual refinement, preventing the image from becoming distorted or nonsensical.

IMAGE DECODER

The “image decoder” is like an artist who takes all the detailed instructions from the image information creator and paints the final picture. It only works once at the very end to create the final image that we see.It works in the following way-

Input: The image decoder receives a compressed version of image information (latent space).

Process: It uses an autoencoder to transform this information back into a full-sized image.

Output: The final, detailed image that matches the text description.

It’s a crucial part of the Stable Diffusion process.Without this, we would only have a bunch of numbers in compressed form. It makes the entire process faster and allows the model to work efficiently with data first and then expand it into a detailed image at the end. It ensures that the final image is accurate and of high quality.

For example, Imagine you’re baking and have all the ingredients for a cake (flour, sugar, eggs, etc.). The image decoder is like the oven that bakes these ingredients into the final cake. Without the oven, you just have a bunch of raw ingredients.

(Image reconstruction through latent space)

THE MATHS BEHIND STABLE DIFFUSION-

The objective function is a contrastive function that modifies the model’s weights so that correct image-caption pairs have a high similarity score and erroneous

Note that during training, the model requires a large number of image-text pairs to be given at once. As a result, each batch comprises 20,000*20,000 = 400,000,000 potential pairs, only 20,000 of which are accurate. For efficient processing, the similarity scores of all possible pairings are computed at once to generate a 20,000 by 20,000 matrix, with the values in the diagonal representing the similarity scores for the correct image-text pairs. As a result, the objective function can be set to maximize the scores on the diagonal while minimizing all other scores.

As we know diffusion models go through forward and reverse processes iteratively by addition and substraction of noise with each iteration.

Now let’s consider the image as a variable, x0 would do.

When this image has more noise then you can name the image as xt whereas, xt-1 would mean a less noisier image.

Now that we set standard notations that could go through, we can define what forward process and reverse process means iteratively-

p(xt-1/xt) — -> reverse diffusion process

q(xt/xt-1) — -> forward diffusion process

(Representation of Forward Diffusion process)

Forward diffusion process can be represented mathematically by the below expression :

where,

N = Normal Distribution

xt = Output

√ 1-βxt-1 = Mean,

βtI = Variance

We keep the variance in bound (in finite limits) by linear scheduling, β when represented on a graph tends to infinity but √1-β tends to 0 which removes any inconsistencies and helps us predict β’s value

In q(xt/xt-1) = N(xt, √ 1-βxt-1, βtI), lets 1-βt with αt wherein we can transform the equation to

√ αtxt-1+√1- αtΣ

We can now do this to the Reverse Diffusion which can also be represented mathematically similarly,

By Linear Scheduling we already fixed the variance to a certain bound therefore making sure that the values are certain in the above equation.

As we know both the processes, we can move to find the loss function of these functions, the loss function is -log(p,q), a negative likelihood of these functions. As its given in Variational Autoencoders, we can use variational lower bound and apply all sorts of formulae to simplify that to the point where we’d need to predict the mean of the formulae but then end up arriving at the noise (which is exactly what we need):

The Noise = ||ε — ε0 (xt ,t||²)

Stable diffusion is a game-changer in AI image generation, combining advanced algorithms with a deep understanding of noise reduction and pattern recognition. Think of it as a digital artist that turns chaotic noise into stunning images through a carefully controlled process.

In today’s world, where visual content reigns supreme, stable diffusion addresses critical challenges like consistency, reliability, and quality. This is vital not only for creative fields like digital art, gaming, and media production but also for practical areas such as medical imaging, scientific visualization, and autonomous systems.

EMERGING TRENDS IN STABLE DIFFUSION

Stable diffusion has become a revolutionary path to image generation. It’s implementation in many industries would mean better output. some of the fields could be:

Medical imaging — sharper X rays and MRI’s

Entertainment — VR, movies & video games will become faster

Personalization — Personalized content as per your needs, always on your fingertips.

Education & training — Simulations would create great training environments for trainees to get the feel of real firsthand experience.

CONCLUSION

Unlocking the Power of Stable Diffusion: A Journey from Prompt to Visual Fest

Written by MUJ ACM