Disco Diffusion: Model Scheduling

Adi
11 min readJul 12, 2022

--

[update 2022–07–15 — Added additional model combinations in the additional results section at the bottom of the blog]

If you’re one of the thousands of enthusiasts playing with AI generated art, you’ve probably heard of Dalle-2, Midjourney, Night Cafe Studio, and the many other commercial art generation algorithms.

There is, however, another world of open source tools largely powered by Google Colab or expensive home set-ups. The tool I’ve been experimenting with over the past few weeks is called Disco Diffusion (DD). In contrast to the polished commercial offerings with a handful of parameters to adjust, DD gives you access to dozens of dials to set, knobs to pull and toggles to… er toggle. The amount of fine tuning possible is mind boggling.

Whereas much has been written about prompt engineering, there are fewer guides on understanding how to tune your settings to get the image that you want. My favourite is Ethan Smith’s “A Traveler’s Guide to the Latent Space” which is his labour of love and a definite must-read if you are experimenting with DD.

He and a few others have painstakingly explored various combinations of settings and how they affect your final output. To give you a sense of how much work is involved in doing this, suppose you want to test the interaction between two parameters, say clamp_max (used to balance how excited the algorithm gets) and skip_steps (which is a way to manage how much noise is introduced into your images at the early stages of generation). Let’s choose a few values to test, for clamp_max (0.05, 0.1, 0.2, 0.3) and skip_steps (0, 10, 30, 50).

To test each combination you’ll need to run the algorithm 16 times while keeping all the other parameters constant. Depending on your settings, the GPU you have to process the image, the weather and the time of day, each generation can take between 5 minutes and an hour. My average image generations run for around 25 minutes. That’s almost 7 hours of running time. Of course, the original prompt may have a dramatic impact on this experiment so you should try at least two different prompts, e.g. one describing a landscape and another for a figure. 14 hours. You may want to chose another random seed with the same settings to validate whether what you’re seeing is really a trend, or only coincidental. 28 hours.

An image from one of Ethan Smith’s experiments.

Bear in mind, computer time is not free … well mostly not. Google Colab offers a free version, but it’s pretty limited and not at all useful for this sort of experiment. For what you get, their Pro+ product is super cheap at $50 / month, but there is absolutely no guarantee that you will have a GPU waiting for you when you want it. If you use it too much then you might end up in GPU jail, i.e. Colab basically tells you that you’ve been hogging the machines and you should give someone else a turn. My recent experience in jail only lasted 12 hours. Frustrating as that was, I’ve heard reports of days and weeks.

To keep it interesting, Google doesn’t actually tell you what your limit is or that you should slow down, or even how long until you’re out on parole. They simply flip the switch leaving you to play the Chrome dinosaur game for hours while you wait for the Google bouncer to let you back into the party. To be fair you’re getting access to some expensive hardware. The GPU I tend to get access to is the Nvidia Tesla P100 which Amazon is advertising for $1000 right now, although some have mentioned spotting A100s which go for around $12000.

If you want more reliability, you might want to spring for your own RTX 3900 for around $2000. You can even rent some else’s GPU on vast.ai. You bid for access to a GPU, either one that is lying underutilsed in a data centre, or in a teenager’s gaming box. If you’re the highest bidder, you’re welcome to use their hardware, electricity, and bandwidth to process your images. As I understand it though, if you are outbid, your process will be kicked off to make way for someone with deeper pockets. At this moment, the cheapest GPU (not necessarily the most cost-efficient) costs just over $0.08 per hour. Using that GPU full-time for a month will cost around $60. Of course, if you’re paying bottom dollar prices, expect the supply and demand curve to be against you.

vast.ai — a snapshot of some GPUs for rent.

One benefit of an established data centre such as Google’s, is that they purchase carbon offsets. 24 hours of compute time on a P100 produces 3.42 kg of carbon. That is equivalent to driving 13.2km in an average car according to this calculator.

Whatever your poison, you’re going to need to consume fossil fuels in order to generate that weird elderly clock monster walking with a cane or the colourful alien desert landscape that is reminiscent of Tatooine or Arrakis

You can see some more of my images here: https://www.deviantart.com/desertplanet.

Experimenting with model schedules

Following in others’ footsteps, here is my small contribution to experimenting with settings in order to coax the algorithm into generating something resembling what you want it to. One of the important toggles you can play with is to choose which models you want to use in your image generation. I won’t explain how models work here, for that it’s best to go to Ethan’s guide. In short, each model will impact the style of your image in different ways. You can use them individually or in combinations. Which combination will work for your image still requires a little black magic and a lot of trial and error.

Until recently, your only choice was which models to use. In the past few days I have been testing a new feature called model scheduling (clip_modules_schedules in the algorithm). This allows the user to schedule at which stage during the process it should enable or disable a particular model. For instance, use RN50 for the first 20% and then VITL14 for the remaining time.

This is an interesting new development which could give us even more control over the completed image. (Or at the very least a few orders of magnitude more settings to explore). For others interested in this new feature, below is a study testing schedules using RN50 and VITL14 in various combinations.

For those who care, here are my settings (except clip_modules_schedules which I will be varying).

Settings for the experiment

My prompt is :

A rampaging African Elephant baring its tusks and raising its trunk. Steampunk. Trending on artstation, Greg Rutkowski.

The experiment is as follows.

Turn on the RN50 algorithm for x% of the steps, and then turn on ViT-L-14 until the end of the generation.

My code looks something like this:

This is the important bit:

For example, if s1 is set to 400 and s2 to 800 then run RN50 for the first 40% and then switch it off. Also keep ViT-L-14 off for the first 20% then switch it on for the remaining 80% of the way. Using this, we can test every combination of running these two models together. Note, s1 + s2 ≥ 1000, otherwise there will be a portion of the run that won’t have a model.

Note, I’m not looking at generating a perfect image but rather testing the effect of scheduling models at different times.

Lets have a look at (0, 1000), i.e run only run ViT-L-14 and keep RN50 switched off as well as (1000, 0) i.e. RN50 without ViT-L-14

(0, 1000) — only ViT-L-14
(1000, 0) — only RN50

Notice that the elephant in (0, 1000) is pretty well-formed. The elephant isn’t charging and it looks a little weird, but you can recognise it as an elephant. (1000, 0) looks nothing like an elephant, it is perhaps a poorly-drawn cloud of smoke with a propeller attached to it. We can also compare the colours. ViT-L-14 is dark and a little grainy whereas RN50 has smoother colours reminiscent of a Disney animation.

(0, 1000) becomes grainy only towards the end of the run. Here it as at 80% complete. There is less definition but almost no grain.

An 80% partial of (0, 1000)
(1000, 1000) —Both models switched on for the entire run

Above is the image that results from running both models from start to finish. In other words, this is the image you would have generated prior to the introduction of the model scheduling feature. We can see the influence of RN50. The colours are less saturated and the general composition is similar to (1000, 0). The figure seems to be have better coherence, possibly due to the influence of ViT-L-14. The steampunk aspect of the prompt is finally coming through.

Let’s look at the sequence where RN50 is always switched on and we enable ViT-L-14 earlier and earlier.

(1000, 0) — Only RN50 is run
(1000, 200) — RN50 runs and ViT-L-14 run for the last 20%
(1000, 400) — RN50 runs and ViT-L-14 run for the last 40%
(1000, 600) — RN50 runs and ViT-L-14 run for the last 60%
(1000, 800) — RN50 runs and ViT-L-14 run for the last 80%
(1000, 1000) — Both models run from start to end.

The early influence of RN50 on the composition continues throughout the sequence. The steampunk details become increasingly more intricate the longer that ViT-L-14 runs. Interestingly, the background has a fantasy look to it in all of the images apart from the (1000, 1000) where both models are always on. In the last image, the greenery looks more photo-realistic than in the previous images.

I find this stark transition from (1000, 800) to (1000, 1000), to be unexpected. The difference between these two runs is that ViT-L-14 is not enabled for the first 20% in (1000, 800) and is always enabled in (1000, 1000). I would have expected these fine details to only come through in the final steps as the algorithm works on smaller cuts. This result suggests that those initial 20% can actually impact detail in the final image. My guess is that perhaps the cartoony feel of RN50 is set in the initial few steps and continues throughout the sequence. Only activating ViT-L-14 from the start can balance that out.

Now let’s look to see the impact of keeping ViT-L-14 switched on and then switching RN50 on for increasingly more steps.

(0, 1000) — RN50 never runs
(200, 1000) — RN50 runs for the first 20%, ViT-L-14 is always on
(400, 1000) — RN50 runs for the first 40%, ViT-L-14 is always on
(600, 1000) — RN50 runs for the first 60%, ViT-L-14 is always on
(800, 1000) — RN50 runs for the first 80%, ViT-L-14 is always on
Both models run from start to finish

There is of course a dramatic change between (0, 1000) where RN50 isn’t on at all, and (200, 1000) where RN50 is on for 20% then switched off. The pose changes from a side profile to what might almost be a front profile. This is clearly the influence of RN50. This pattern persists as we keep RN50 switched on for longer.

Let’s focus on what happens in the generation when the RN50 stops.

This is (200, 1000) at around 16% where both models are working.
And this is (200, 1000) at 25% where ViT-L-14 gets the image all to itself.
(200, 1000) from 9% — 100% complete

Above the is entire generation for (200, 1000). From 0–20% both models are working. Then from 20% onward only ViT-L-14 is enabled. What we see is the initial composition cemented early on, and then ViT-L-14 is able to add the fine background details that don’t feature in RN50-heavy generations while avoiding the fantasy-like colouring.

Here are all of the combinations of model schedules. In my opinion, very every setting of ViT-L-14 the background becomes less realistic as RN50 is run for longer.

All combinations of model schedules

You can see the all the full resolution images with their partials here.

Conclusion

I think that tweaking model schedules has promise and can give us insight into how each model works. From the images above, I observe the following:

  1. RN50 has a very smooth and cartoony feel to it
  2. In contrast ViT-L-14 opts for more realism, especially the background. If you want more realism, run RN50 for fewer steps, more fantasy, run it longer.
  3. ViT-L-14 tends to become grainy and it might over-saturate. Introducing RN50 counters this.
  4. The first few steps are very important. Models that are active in the first 20% literally set the scene for what is to come. This makes sense in the context of biasing overview cuts earlier in the generation, but it also impacts fine details that you would expect to only feature later on.
  5. ViT-L-14 is far better at fine details than RN50. In every sequence, the longer ViT-L-14 runs, the more detailed the steampunk brass and dials become.

Perhaps point 4 suggests that before generating an image, it may be worth generating large numbers of partials to around 20% with various models to get a sense of the final composition before proceeding with generating the remaining 80% of the image.

Once you have your composition right and the generate the final image, you can fine tune the details by slightly increasing or decreasing how long a model is enabled.

Limitations

This study is really only scratching the surface. Not only should this be tested with different models and prompts, but cut scheduling becomes even more important. Also, this study assumed that RN50 runs for x steps and then stopped while ViT-L-14 is only enabled after y steps. I started doing it the other way round, i.e. start with ViT-L-14 and end with RN50 but I encountered the ominous black image bug which I’m still trying to work around.

I’ll continue experimenting with other model combinations and report back in future blogs. Hit me up if you have any thoughts or suggestions.

Links to additional results

RN50/ViT-B-32
RN50/RN101
ViT-B-32/RN50

--

--

Adi

Data nerd. Dabbler in data journalism. Coder. Full-time data investigator.