Disco Diffusion: Comparing ViT-B-32 weights (Part 1)

Adi
4 min readJul 15, 2022

--

[Update 2022–07–21 — I wrote part 2 comparing weights without the secondary_model]

MLFoundation’s openclip project provides a number of different clip models training used a variety of datasets. I was interested to see how much of an impact the weights had on a final image produced with Disco Diffusion.

Here I focus on ViT-B-32 which is a Vision Transformer with around 86m parameters. The different weights I looked at where:

  • openai
  • laion2b_e16
  • laion400m_e31
  • laion400m_e32

as well as these quickgelu variants:

  • openai-quickgelu
  • laion400m_e31-quickgelu
  • laion400m_e32-quickgelu

Using Disco Diffusion with mostly standard settings:

  • 250 steps
  • image size of 512x384

And the following prompt: “A beautiful matte painting of a steampunk Kashin elephant in the style of Max Ernst at dawn, soft light shining.”

Elephant Celebes — Max Ernst

I also chose the following parameters for clip_guidance_scale: 1, 10, 100, 1000, 5000, 10000, 15000, 20000, 25000, 30000, 35000, 40000, and 45000

Below is a matrix of all the test runs, I have ordered the weights in increasing order of quality of output (in my opinion).

Comparison of different weights for ViT-B-32

You can see the full-sized image here (160mb).

Unsurprisingly, at low clip_guidance_scale values, the images were not at all related to the prompt although I’m curious why all variants agreed on the same image when set to 1 and only slight variations when set to 10.

I think the best images were produced at a clip_guidance_scale value of around 35000.

Let’s take a closer look:

quickgelu-laion400m_e31 and quickgelu-laion400m_e32

quickgelu-laion400m_e31
quickgelu-laion400m_e32

The outputs were quite similar with e32 seemingly slightly more coherent although there are barely any features in these images that resemble an elephant. There is a suggestion of dawn in the background but the colours and textures don’t really make sense.

quickgelu-openai and openai

quickgelu-openai
openai

Openai and its quickgelu equivalent are very similar. They both feature some sort of creature. The lighting looks like the sun might be rising. Overall, better than the previous two but not by much.

laion400m_e31 and laion400m_e32

laion400m_e31
laion400m_e32

e31 seems to have smoother shading than e32. I also prefer the sun rays. Neither variant succeeds in producing a creature anywhere resembling an elephant. e31 does have a slight edge in that the body shape of the creature is closer to the round boiler-shape of Ernst’s elephant.

laion2b_e16

laion2b_e16

My favourite — the sun is rising in the background, colours are golden, the elephant has a round body a trunk, tusks and at least two legs. There is an unwanted golden elephant in the clouds but to me, laion2b takes the prize by a long shot.

1024x768

Clip models are often sensitive to image dimensions. I tried the same settings and prompt but changed the output image size to 1024x768. Here are the outputs of openai, laion400m, and laion2b with a clip_guidance_scale of 30000.

openai

I think openai does far better in this larger size. The textures are great. The colours are soft, and there are some forms in the image. Unfortunately, the only thing that looks like an elephant at all is the strange tusk in the background in the red light.

laion400m

I’m not quite sure what laion400m was getting at — it could be partway to some artwork but it doesn’t really hang together. There is a rising sun in the background but I’m not really sure what I’m looking at.

laion2b

Again, I think that laion2b produced the best quality output. Again there are more elephants in the clouds which are a little surreal but overall this is the clearest image by far.

Conclusion

Objectively, none of the images adequately represented the prompt, but my goal was not to find the best image but rather to explore the different variants of ViT-B-32.

Two things worth noting:

  1. I am probably going to choose laion2b over openai in future. While openai manages to produce great looking textures in 1024x768, one gets the feeling that elephants don’t feature much in the training data, and certainly not Max Ernst.
  2. Output resolution matters, a lot. Bear that in mind if you’re struggling to produce the image that you want.

Full resolution images and partials are available here: https://drive.google.com/drive/folders/16K0km2eOYDiHPB8F9qgfc05A6pIQ_Tmb?usp=sharing

--

--

Adi

Data nerd. Dabbler in data journalism. Coder. Full-time data investigator.