Stable Diffusion: Trending on Art Station and other myths

Adi
12 min readSep 1, 2022

--

[Update 1: 2022–09–01 — Added a section exploring the effect of guidance scale]

[Update 2: 2022–09–01 — Added some musings about the information content in a modifier]

[Update 3: 2022–09–01 — Explored whether adding the modifier at the beginning of the prompt changes anything

[Update 4: 2022–09–01 — Hat tip to @muskoxnotverydirty who provided an alternative suggestion for why experimentation pre-SD has been costly.

[Update 5: 2022–10–03 — I wrote a follow-up to this post here]

It’s been just over a week since the public launch of StableAI and the number of derivative tools built around it numbers in the dozens. Whatever you think of AI art, there is no denying that the speed of adoption is astonishing. Considering Malcolm Gladwell’s technology adoption cycle, I think we’re nearing the end of the early adopters phase and entering the early majority. I see an increasing in the number of views on beginner tutorials for the various tools which suggests a large influx of new users. Headlines like Stable Diffusion Goes Public — and the Internet Freaks Out, while bombastic, do seem to capture the current mood.

Credit: Author unknown but first found on creativegeneralist.blogspot.com

StableAI and other guided diffusion models are largely driven by text or image prompts, although text is much more common. The prompt tells the algorithm what the user wants to see. This so-called prompt engineering is not trivial. While the algorithms understand natural language input, some tweaking is often needed. Adding the name of an artist can change your output significantly. See the images below which use the same prompt, but with slight variations:

Stable Diffusion: Fluffy white clouds (left) and Fluffy white clouds by van Gogh (right)

You can also mention medium or style

Stable Diffusion: A child’s drawing of fluffy white clouds (left) or Fluffy white clouds by van Gogh in charcoal (right).

Much of the discussion on the Discord servers and sub-Reddit communities revolves around improving your prompts. Which words to add and which to remove. Here is a link to one of many Google Colab notebooks that helps you create new prompts by providing various modifier words. The raw list of words can be found here.

Some examples:

Extracted from https://github.com/WASasquatch/noodle-soup-prompts
Extracted from https://github.com/WASasquatch/noodle-soup-prompts
Extracted from https://github.com/WASasquatch/noodle-soup-prompts

It’s worth understanding what is going on under the hood and why prompts are so important. The way these algorithms are able to create a bunch of images of dogs given the prompt “dog” or cats with “moody cat” is due to the way the algorithms learn. Given millions of image-caption pairs harvested online, a hyper-dimensional latent space is created which maps similar images and captions together. This simply means that the algorithms convert an image to a series of numbers and cluster like-images together.

Below is a 2-dimensional example of what I mean. Pictures of watches are at the top, men to the left of that, women below. On the far right is shoes. In this diagram, an image becomes more “shoe-y” and less “man-ly” as you move from left to right.

Credit: J. Rafid Siddiqui, PhD

Of course, you can’t really do this in 2-dimensions. In reality there are millions of dimensions. You can move in one dimension without changing your position in another dimension. Say you are in the middle of a clothes cluster of images. Moving in the direction of men you will find images that more resemble that types of clothes that men wear. Move toward women and you will find fewer images of ties and more of dresses and skirts .

In this way, you can perform hyper-dimensional arithmetic. A fun example of this is King — man + women = Queen. You can explore the Laiaon 5b(illion image) dataset using this nifty CLIP retrieval tool to get a sense of how words are a associated with images.

A sort of folk-wisdom has started to develop in the AI art communities. The most common one is to add “trending on ArtStation”. Here are a few more:

Extracted from https://github.com/WASasquatch/noodle-soup-prompts

The idea is that images on Art Station look better than the average image. If you add “trending on artstation” to your prompt, you are pulling your image in the hyper-dimensional space toward images that “look good”.

Stable Diffusion: Fluffy white clouds trending on artstation.

Prior to Stable Diffusion’s public release, it was only available to beta users on a Discord server. Users would enter a prompt and the image would be generated in Discord for all to see. Some anonymous hero scraped this data and another created this amazing tool that lets you search for prompts to see what others have done. It is really useful to learn from others when creating your own prompts.

I downloaded the database and wondered how common “trending on artstation” actually is. The complete database contains 1.5 million prompts. Of those, here is a count of how many times various “trending” modifiers have been used:

trending on artstation: 316,936
trending on cgsociety: 31,910
trending on deviantart: 19,545
trending on behance: 12,753
trending on pinterest: 4,777
trending on conceptartworld: 1,578

Yes, your calculator isn’t lying, over 21% of all prompts used some form of “trending on artstation”. Clearly that’s why Stable Diffusion produces such amazing images right?

As with everything, if you can, you should test for your self. For my experiment, I found this prompt on the Stable Diffusion Discord server (apologies to the author, I couldn’t find your username):

prison interior, gloomy, dark, wide angle, humid, unreal engine, daunting, silent hill, by paul chadeisson, atmospherical, concept art, high detail, intimidating, cinematic, octane render

Here are the complete settings:

{
"ddim_eta": 0.75,
"guidance_scale": 7.5,
"init_image": "",
"init_strength": 0,
"num_batch_images": 10,
"prompt": "prison interior, gloomy, dark, wide angle, humid, unreal engine, daunting, silent hill, by paul chadeisson, atmospherical, concept art, high detail, intimidating, cinematic, octane render",
"sampler": "klms",
"samples_per_batch": 1,
"seed": 1454236489,
"steps": 100,
"width": 704,
"height": 384
}

It produces some really interesting images:

Let’s add various trending terms to compare what impact they have on the final image.

My methodology was as follows:

  1. Choose a random seed so that the experiment is repeatable (1454236489)
  2. Check that the same random seed produces exactly the same image (it does)
  3. Create 10 images without any trending modifier
  4. Create 10 images with the following trending modifiers: artstation, concept art world, flickr, behance, deviantart, pixiv, illustrationx, cgsociety, unsplash, google images, pinterest, sketchfab, artsy, national gallery of art highlights, saatchi art

Here is a side-by-side comparison of the top modifiers:

Comparison of trending modifiers

It’s clear that the general composition and lighting of each image is pretty much the same. Let’s take a closer look.

Image comparison of four trending modifiers

There are some slight differences but you need to look really closely to find them. Even then, it isn’t clear which image is better. The other 9 images are similar. My conclusion is that these modifiers do very little, if anything at all. Even the differences could be due to minor changes in the encoding of the prompt into the hyper-dimensional space. Perhaps adding any gobbledygook to the end of the prompt would have a similar effect.

[Update 1: 2022–09–01]

This is pretty much where this original post ended. After posting Zealousideal_Pea4679 on r/StableDiffusion suggested that the effect of the modifiers may only make a difference at higher guidance scales. To check, I re-ran my experiment with the following guidance scale values, 2, 4, 8, 15, 20, 25. I also added the Behance modifier for good measure.

Testing various modifiers using lower guidance scale values — Image 1 (open in new tab for full resolution)
Testing various modifiers using lower guidance scale values — Image 2 (open in new tab for full resolution)

for low guidance scale, my results don’t really change. It looks like the magic starts to happen with higher guidance scales.

Testing various modifiers using a range of guidance scale values — Image 1 (open in new tab for full resolution)
Testing various modifiers using a range of guidance scale values — Image 2(open in new tab for full resolution)

The first thing to notice is that all images regardless of modifier start to increase in quality and have a much strong cinematic feel to them. This is true even for images without a modifier. It’s interesting to note that the perspective becomes increasingly more extreme. At a guidance scale of 4, the images seem to have an isometric-like perspective. At 15, we see a definite 1-point perspective, at 25 we actually see the horizon at infinity (more or less)

Guidance scale of 4 (left), 15 (middle), and 25 (right) with no modifier

Examining the images more closely, it looks like the artstation modifier produces images which are similar to other modifiers with a higher guidance scale.

Consider Image 1 at a guidance scale of 15.

Guidance scale of 15, CG Society (left) and ArtStation (right) — Image 1

Notice that ArtStation has 4 windows, whereas CG Society has only 3. This is the same for the other modifiers.

We don’t see the same pattern in the second image so perhaps it was a coincidental.

Guidance scale of 15, CG Society (left) and ArtStation (right) — Image 2

At a guidance scale of 20, ArtStation now has 4 windows while ArtStation is already at infinity.

Guidance scale of 20, CG Society (left) and ArtStation (right)

By the time we reach 25, No modifier, ArtStation, and DeviantArt have infinite corridors while CG Society and Behance have settled on only three windows.

Behance (top left), CG Society (top middle), DeviantArt (top right), ArtStation (bottom left), No modifier (bottom right) — Image 1

For the second image, Behance looks different to the others, and to my eye, crisper and more interesting.

Behance (top left), CG Society (top middle), DeviantArt (top right), ArtStation (bottom left), No modifier (bottom right) — Image 2

So what do we make of this? I’m not really sure. At lower guidance scales, the modifiers don’t seem to make a difference. At very high guidance scales, there is barely any difference between No modifier and ArtStation. At 8 and 15, Artstation is clearly different in image 1. I quite like the 1-point perspective to infinity, but it isn’t very realistic. Images with 3 and 4 windows hit the sweet spot in my opinion.

There isn’t much difference in image 2 between ArtStation and the rest as there is in image 1.

My main conclusions are:

  1. Guidance scale has a much bigger influence than the modifiers. For this prompt, high values are far more dramatic than lower values.
  2. The modifiers do change your image, slightly, but in unpredictable ways. If you like your image but want to slightly tweak it, then changing the modifiers might get you closer to what you want.

[Update 2: 2022–09–01]

A slight digression

@morbuto on the Stable Diffusion Discord highlighted what I mention later in the limitations section of this blog, that more powerful modifiers trump less powerful ones. Octane render for instance is a much more discriminative modifier than Trending on Artstation is. When you strip the prompt right down, a trending modifier can have a big difference.

A woman (left), A woman trending on Artstation (right). Images by @morbuto

This got me thinking about what “powerful” actual means in the context of a modifier. I think it has the same meaning as a lower entropy in an information theoretical sense. In other words, trending on Art Station implies that the image is “good”, Octane render suggests a 3d render, i.e. it is easier to predict what the image would look like it you knew that it was a 3D render as compared to knowing that the image was trending on Art Station

Whether it is useful to know this or not is unclear, but helps me understand modifiers a little better.

[Update 3: 2022–09–01]

Modifier at the beginning or the end?

@no_witty_username on Reddit suggested that modifiers at the beginning of the prompt have more weight than those at the end. I wanted to check if this was true.

As with placing the modifier at the end of the prompt, placing it at the beginning doesn’t really have any impact when guidance scale is 8 or below.

Things start to change at around 15. Below I compare beginning and end for guidance scale = 20.

Comparing adding the modifier at the beginning and at the end (open in new tab for full resolution) — Guidance Scale: 20

If modifiers at the beginning of the prompt have more influence than those at the end, then we would expect the At start images to be more diverse than the At end images. Indeed, I think this is the case. CG Society and Behance are significantly different to the No modifier example. Art Station At start and At end are quite similar to each other, and to No modifier. DeviantArt is the same.

With my base prompt, I tend to agree with @no_witty_username. The difference might be more pronounced with other prompts with fewer competing modifiers.

Back from the digression

So why are people using these modifiers if don’t make much difference? Depending on your point of view, prompt engineering is either more art or more science. Right now, most people believe it’s an art and so systematic testing is not widespread. There is, however, amazing work coming out all the time with people testing various variations, settings, and techniques. It’s worth looking out for them.

[Update 4: 2022–09–01]

@muskoxnotverydirty on Reddit suggest this:

Before SD, people had a limited number of attempts each day, or paid out of pocket. You might’ve thought, “I don’t know if this keyword really helps, but I’ll throw it in anyway just in case.” It didn’t seem to hurt, and no one wants to waste an attempt. This signal was self-reinforcing, since you’d see other’s results use the prompt, but those people cherry-picked the best results, so you’d associate the keyword with quality. Due to the randomness of diffusion models, it’s very difficult to gauge if something works when you have limited attempts.

I think that certainly partly explains why there is a lot of untested lore out there. My main reason for moving from Night Cafe to Disco Diffusion was precisely that reason. Experimentation was far too expensive. The best you could do was copy prompts that you liked and tweak them. Disco allowed me to create thousands of images and tweak the settings with each run.

Conclusion

Prompt engineering is really at its infancy, there is a lot more exploration needed to figure out how to coax the best images for a particular topic. An unfortunate complexity is that this knowledge may not even transfer between different algorithms. What might work in Dall-e 2 may not work in Stable Diffusion and vice verse.

I hope this blog makes a slight contribution to the field, or at the very least, saves you from typing the 22 characters needed to add “trending on artstation”

PS — There is a good discussion about these findings on Reddit.

Limitations

There are a few limitations to this experiment:

  1. I only tested on Stable Diffusion. Much of this lore was born in the Disco Diffusion, Midjourney, Night Cafe, and other communities. Perhaps the modifiers work with their algorithms.
  2. I only added “trending on X” at the end of the prompt, it may be the case that different positioning may have an impact.
  3. I only tried one prompt, albeit with 10 variations. Perhaps the modifiers work with portraits or landscapes.
  4. It may be the case that the other modifiers in the prompt are far more powerful. Without testing, I would guess that cinematic and octane render probably have a much bigger influence on the final output. If these were removed, the trending modifiers might come into play.
  5. suspicious_Jackfruit on Reddit suggested that medium matters. Photographs for instance may not be trending on ArtStation. As with everything in this post YMMV.

Extras

The complete series of images with the modifier at the start of the prompt.

--

--

Adi

Data nerd. Dabbler in data journalism. Coder. Full-time data investigator.