Disco Diffusion: Comparing ViT-B-32 weights (Part 2)

4 min readJul 21, 2022

In part 1, compared different weights for the ViT-B-32 clip models. There I concluded that laion2b_e16 gave the most coherent result, followed by laion400m_e31/e32 and openai. In that study, the secondary model was being used. Here I repeat the same methodology as Part 1 with the exception that the secondary model is switched off.

I used standard settings as per discoart 0.7.4. You can find the complete settings here.

As with part 1, the prompt I used was:

A beautiful matte painting of a steampunk Kashin elephant in the style of Max Ernst at dawn, soft light shining.

Here are the results:

Comparing different weights for the ViT-B-32 clip model

Let’s break it down:

At very low clip_guidance_scales (1 and 10), we get an image which is totally unrelated to our prompt. All weights return approximately the same image of a duck.

At a clip_guidance_scale of 100, laion2b and openai are surprisingly similar. Of the three images, laion400m looks like it presents the best depiction of an elephant.

laion400m with a clip_guidance_scale setting of 100

Looking at the laion2b results from clip_scale_guidance 1000 and higher, the round shape of Max Ernst’s Celebes elephant start to appear. It isn’t perfect, but it is clearly moving in the direction of a rounded creature rather than a traditional elephant. In terms of coherence we can see two legs and a trunk, and possibly a tail.

laion400m also attempts to create a round-bodied elephant. The quality of the image feels more like an illustration than laion2b which with better settings might resemble a 3d render.

The elephant parts are less convincing. The trunk ends in a sharp hook. There are only three legs, all bending backwards. It also looks like it’s flying whereas laion2b’s elephant is firmly anchored to the ground.

The openai weights flummoxed me. While images produced by the laion models converged to a single image as the clip_guidance_scale was increased, openai produced almost completely unrelated images for every setting. It has some cool ideas like an elephant walking towards a building

or an elephant in a field

at higher levels it ditches the ideas of an elephant at all and begins to create monsters

In some cases we don’t see the entire creature in the frame:

In some of the images, most elephant parts are visible, legs, tail, and trunk which is good to see.

Conclusion

Overall, the results without the secondary model are head and shoulders above the results in part 1 with the secondary model. This isn’t surprising but worth noting. Every set of weights is improved.

As with part 1, I’m going to go with openai, laion400m, and laion2b in order of increasing preference. Some of openai’s images get an honourable mention but its lack of stability would make it hard to predict what a change in clip_guidance_scale will achieve.

As always, we might not be able to generalise these results across all settings. Another study with different settings might reach different conclusion.

Original Images

You can see the full-resolution images here.

Disco Diffusion: Comparing ViT-B-32 weights (Part 2)

Conclusion

Original Images

Written by Adi