Intro to Stable Diffusion — A Game Changing Technology for Art

Published in

Short Bits

7 min readSep 7, 2022

Stable What Now?

Over the last few months a ground breaking technology has hit the public which I believe will have an era changing effect on the industry of art and graphic design. Ai art generators such as DALL-E, Disco Diffusion, Dream Studio, Mid Journey and an increasing number of others have been recently released publicly. These models are able to take text input and generate incredibly creative results. In this article I want to dive into Stable Diffusion (https://github.com/CompVis/stable-diffusion), an open source model by Stability AI (https://stability.ai/), and some of my experiences using it over the last few weeks.

While not a profession artist, I have been doing personal and some paid work in sketching, logo designs, and album art over the past 7 years and have learned a bit about usage of products like Blender, Krita, Affinity Photo, Procreate and others as a pseudo hobbyist. From my amateur perspective there are a number of impacts ai art generation will have on the industry.

But what is Stable Diffusion? what makes this one interesting?

Latent Diffusion

The answer to that question is Latent Diffusion. An algorithm for generating images (similar to GAN models) with the approach of using known images during training and iteratively introducing noise to the image until approximately complete noise has been reached. This training process is then inverted when generating a new image by starting from text or image and a pure noise and iteratively reducing noise till a complete photo is generated.

I will admit I am no expert here though. Its deep and I’m still learning a lot.

https://github.com/CompVis/latent-diffusion

Interesting, but what is the latent space? Latent space can be thought of as the compressed layers of the model representing the input. These compressed layers are the result of wider network layers reducing nodes to become smaller layers often in the middle of the neural network (ie UNets) after the encoding phase which allow for removing noise and learning only the nodes that needs to be kept to then reconstruct the image into a “validated” result as far as the model is concerned. For a deeper understanding of latent diffusion I would recommend reviewing this article by Louis Bouchard (https://www.louisbouchard.ai/latent-diffusion-models/) or this one by Ekin Tiu (https://towardsdatascience.com/understanding-latent-space-in-machine-learning-de5a7c687d8d).

This latent diffusion is finally able to generate results as seen below. If you watch closely you will be able to see some results start off as noisy images before being refined with clarity; here is the effect of diffusion in progress.

https://github.com/CompVis/latent-diffusion/raw/main/assets/results.gif

Stable Diffusion

Now that we know a bit about how latent diffusion works, lets look at the open source implementation so you can start generating art. Stable Diffusion is a project available on github (https://github.com/CompVis/stable-diffusion) and written in python and pytorch. It has 2 primary modes: “txt2img” and “img2img”. These scripts operate as you might expect, one takes text as input and generates an image, while the other takes an image (and text) and generates an image.

Setup

NOTE: Stable Diffusion is usable online for a low cost at Dream Studio (https://beta.dreamstudio.ai/home) but this article focuses on local use from the command line with python.

While its not too difficult to setup Stable Diffusion (assuming abit of python knowledge) here’s a few suggestions on setup to help avoid issues.

Its advised to have a GPU capable of Nvidia’s CUDA framework to run stable diffusion. If you have less than 10gb of VRAM on the device like I do (ie a RTX 2070 Super) there are work arounds for generating higher resolution images. consider looking at Basu Jindal’s optimized fork — https://github.com/basujindal/stable-diffusion.
another memory optimization you can easily add (assuming you have pytorch 1.12+ installed) is to configure this pytorch environment variable as seen below. simply put this at the top of the txt2img or img2img scripts.

os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:128"

tensorflow checkpoint (https://www.tensorflow.org/guide/checkpoint) files can get large. Stable Diffusion’s model checkpoint version 4 ( sd-v1-4.ckpt ) is around 4GB; this is why its recommendable to store the checkpoint file on a drive separate from your primary drive and alias to the file with alias (bash) or Set-Alias (powershell).
use anaconda. anaconda or another environment manager for python will give you the ability to spin up different environments for various forks of the project and avoid dependency conflicts.

Resolutions

Stable Diffusion can require a bit of memory. Below is a chart I made that shows all the resolution combinations I’ve tried that work on my RTX 2070 Super using Basu Jindal’s optimized fork. This diagram has been a handy reference. Depending on your memory you may be able to generate higher resolutions with or without the optimized version of stable diffusion or with a newer model of the project. But since the model seems to work best with 256 x 256 pixel increments I have found this diagram useful.

Modes

Stable diffusion has the ability to be run from the command line with python scripts or from a gradio app. However many forks of Stable diffusion are introducing other ways to use the project such as with docker containers. This article focuses on the python script approach though.

Images From Text

https://github.com/basujindal/stable-diffusion

The first primary use case you will likely try out is providing some text, and receiving images as output. For a description of a few parameters here:

prompt —the text description of your creation
W — the image width resolution (ie 512)
H — the image height resolution (ie 512)
seed — somewhat unique stable diffusion is seed. each seed will generate the same image when run providing a easy way to recreate images you’ve already created if needed. Although changing the prompt text or resolution can change this.
n_iter — the number of batches of photos run
n_samples — the number of photos generated per batch
turbo — (optional) if using Basu Jindal’s optimized fork this improves memory management for batches with ~3–4 images
device — (optional) specify the gpu or cpu to use on your machine. often simply ignore or set to “cuda” if using a RTX gpu.

python .\optimizedSD\optimized_txt2img.py `
--prompt "<TEXT>" `
--H 512 --W 512 `
--seed 1 `
--n_iter 10 `
--n_samples 2 `
--ddim_steps 50 `
--turbo `
--out "<PATH>" `
--device <cuda | ....>

Images From Images + Text

https://github.com/basujindal/stable-diffusion

This mode takes in a path to an image of any kind (drawn, empty photo or existing photo) and a text prompt and generates a new image. This script is very similar to txt2img but has a few new parameters.

init-img — the source image you would like to base a new image off of
strength — the amount of source or new image to use in the result (0 = mostly source, 1 = mostly new)

python .\\optimizedSD\\optimized_img2img.py `
--prompt "<TEXT>" `
--init-img "<PIC PATH>" `
--W 512 --H 512 `
--seed 1 `
--n_iter 10 `
--n_samples 2 `
--strength 0.6

Generate With Custom Weights

https://github.com/basujindal/stable-diffusion

Basu Jindal’s optimized fork also has the ability to let the user weight inputs using the format: <TEXT>:<WEIGHT> .... this makes it extremely useful to generate images based on specific percentages of impact for different parts of your prompt.

prompt = "cute dog:0.95 cyborg:0.85 hybrid character in style of anton fadeev"

Up-scaling Resolution

https://github.com/jquesnelle/txt2imghd

If your using a GPU with less than 10gb of VRAM its likely you will start wondering how to get versions of your images in higher resolution. One project that works really well I came across after trying out multiple options is jquesnelle’s “txt2imghd”. Using this project you can upscale a 768x768 image to 1536x1536 or upscale a 768x768 image twice to get a 3072x3072 image. The images I’ve generated at 3072x3072 pixels sometimes require abit of photoshop to cleanup but overall look good enough to make a spotify standard cover at 3000x3000 pixels.

python txt2imghd.py `
--prompt "<TEXT>" `
--img "<path to png/jpg>" `
--H 256 --W 256 `
--seed 1 `
--steps 50

NOTE: this uses an up-scaling model called https://github.com/xinntao/Real-ESRGAN*

Conclusion

Stable Diffusion is a big deal. I intend to continue exploring this and recording my research on the topic with references, studies and prompt examples on this GitHub repo (https://github.com/HarmonicHemispheres/stable-diffusion-research). This post has left out a lot as this is a quickly evolving topic but we will revisit the topic in more specific depth as time moves on.

Thanks for reading and stay tuned for more!

References & Resources

original stable diffusion repo — https://github.com/CompVis/stable-diffusion
the stable diffusion team — https://stability.ai/
run some open source models in the cloud — https://replicate.com/
my model exploration — https://github.com/HarmonicHemispheres/stable-diffusion-research

Intro to Stable Diffusion — A Game Changing Technology for Art

Stable What Now?

Latent Diffusion

Stable Diffusion

Setup

Resolutions

Modes

Images From Text

Images From Images + Text

Generate With Custom Weights

Up-scaling Resolution

Conclusion

References & Resources

Written by Robby Boney