An introduction to Stable Diffusion

Human creativity in a few gigabytes

Cas van den Bogaard
Sogeti Data | Netherlands
7 min readAug 31, 2022

--

The past year has seen incredible advances in the field of generative AI, specifically in text-to-image models. The year 2022 saw the release of OpenAI’s DALL-E 2, quickly followed by two models published by Google: Imagen and Parti. Google has chosen to not release their models, due to concerns about biases in the generated data. Image generation with DALL-E 2 has been made available through the OpenAI website, but the model itself remains unavailable to the public.

A new challenger…

On August 22nd, StabilityAI open-sourced their model: Stable Diffusion. They have released code to train and run their model, as well as an instance of the model that has seen a dataset of over 5 billion image/text pairs (LAION-5B). And the best part? It fits on consumer hardware, allowing everyone with a beefy PC to run it locally!

Stable Diffusion’s open-source nature allows for massive customizability, and running the code locally gives you the ability to generate tens or hundreds of images for free. Because of that, exploring Stable Diffusion is absolutely worth it, even though it’s single-image performance is not as good as DALL-E 2’s.

In this blogpost I will give a high-level explanation of the model, show the cool things that it can do, and point you to some resources so that you can try it out yourself.

Diffusion models

Stable Diffusion — just like DALL-E 2 and Imagen — is a diffusion model. Diffusion models are taught to remove noise from an image. The model is fed an image with noise and is asked to produce a slightly less noisy version of that image. By repeating that process, you can turn a noisy image into a perfectly sharp picture. The model is not just guessing what is in the image: A textual description of the content is provided to the model.

Now how does a denoising tool produce images from text? By starting with nothing but noise! The diffusion model has learned what images are supposed to look like, and slowly starts creating structures in the noise. Of course, the noise that we started with has no underlying structure, but the model doesn’t know that. By guiding it using a text prompt, the model starts looking for things it expects underneath the noise. “I’m looking for a human on the beach. Maybe this bit of noise is an arm, and then there is supposed to be some sand here?” And before you know it, the model has produced a holiday picture for a non-existent human.

The trick with Stable Diffusion that allows for it to be used on relatively cheap hardware, is that it has not learned to denoise actual images. Instead, it first compresses the image into something smaller, called the ‘latent space’. This does away with unnecessary information and leaves only the important parts of the image in place. The denoising then happens on that compressed version of the image. The denoised latent image is decompressed again, leaving us with that image of a surfing sloth we were asking for!

Architecture diagram for stable diffusion. It shows the input prompt (text embeddings created with CLIP), the diffusion model itself (UNet) and the decompression model (decoder of a variational autoencoder). [source]

The fun stuff, part 1: text to image

Let’s start generating some images! One thing that SD is particularly good at is generating fantasy landscapes. Perhaps you want to see what it would look like when humanity has disappeared, and nature has reclaimed our cities.

Prompt: ‘Dystopian cityscape, nature reclaiming buildings, digital art’

Or perhaps you want to create a birthday card for your sloth-loving friend.

‘Cute sloth blowing out candles on a birthday cake, realistic, photograph’

As you can see, we get very different images for the same prompt due to starting with a different bit of noise. We can also flip this around and use the same starting noise, with slightly different prompts.

Prompts: ‘Portrait of a {dwarf who lives in the jungle, blue-green clothing, leafs | young dwarf who lives on the beach, blue-green clothing, shells | dwarf king, red-gold clothing, gems} . digital art, trending on artstation’

As you can see, different prompts cause a lot of changes in the small-scale structures (details in face, clothing, color scheme). However, there is still a lot of similarity in the large-scale structures (posture, subject size) per row, due to the starting noise being the same.

Another trick is used in the above prompts: adding ‘trending on artstation’ . This is an example of “prompt engineering”, in which you try to find words or phrases that steer the AI in a direction you want it to go. In this case, we’re asking for an image that looks like those that are popular on the ArtStation platform. That way, the AI knows that we’re not just looking for any image, we’re looking for something that a lot of people would like!

The fun stuff, part 2: image to image

We’ve seen that SD can generate images that match our prompt, while starting from nothing but noise. That’s great, but it’s not the only way that it can be used. Let’s see what happens if we don’t start with just noise, but instead give it an image to start off with.

Perhaps you really liked the composition of the third birthday-sloth that we generated, but you didn’t like some of the details. We can input the same prompt as before, but this time replace the starting noise with the sloth and add just a bit of noise. That bit of noise causes the changes from the original image.

Prompt: ‘Cute sloth blowing out candles on a birthday cake, realistic, photograph’, starting from previous output.

The starting image isn’t limited to a previous output of the model. We can start with any picture and ask the model to turn it into something else. This watercolor of a panda eating some bamboo, for example, which can be turned into a photograph. By varying the strength of the noise that we add to the image, we can vary how much the original image changes.

Prompt: ‘Panda eating bamboo, photograph 4k’, Noise strength {40/50/60}%.

Another approach is to sketch out a composition and ask SD to fill in the details. We can then chain multiple image-to-image steps, picking the outputs we like best and generating new variations of those. In each iteration it is possible to tweak the prompt or model settings. Below is the process of creating a fantastical landscape, starting with a very rough sketch.

Starting prompt: ‘A fantasy landscape. digital art, trending on artstation’, Prompt after second generation: ‘A fantasy landscape, stars in the sky. digital art, trending on artstation’

Joining the fun

With SD being open source, anyone can develop their own products around it. It is only a matter of time before all kinds of services making use of it are going to be created. For now, there are a couple of ways in which you can start creating your own images:

  1. DreamStudio, a service hosted by the StabilityAI themselves. The first couple of image generations are free, but afterwards you are asked to pay a fee, since you are using their GPUs.
  2. Google Colab, an online notebook environment where you can use cloud GPUs for free! The official SD notebook can be found here, but you can also write your own code to run the models.
  3. Run it locally. The model is available through a Hugging Face library called Diffusers, and through the stable-diffusion repository. These repositories work best with GPUs that have 10GB or more VRAM. For those of us with smaller GPUs, this fork reduces the VRAM required at the cost of inference speed.

Responsible use

Thanks to these generative AI models, it’s incredibly easy to create interesting images. However, that also introduces a whole host of problems, some of which are more subtle than others.

Of course, bad actors might use these models to create problematic content such as ‘deep fakes’, or images that are deliberately offensive or discriminatory. SD includes an offensive-content filter by default, but it is easily disabled by those who want to do so.

Tools can always be used to do harm. There are also more subtle issues, though, which well-willing users might skip over. Biases in the dataset used will lead to biased generated content, such as stereotypical imagery. Using the results of these models without thought can result in the reinforcement of these negative stereotypes. OpenAI tries to circumvent this issue by silently adjusting prompts to force diversity in the generated images. This works to an extent, but only for those cases of bias that are programmed in by the developers. With SD, no such safeguards exist, and the onus is on the user to consider what impact their creations will have. Making sure to not use it to do harm is the gist of the licensing terms, so adhere to that and you’ll be good to go!

--

--