This blog is a follow-up on my previous blog of a similar name.
I have also written this blog up as a Google Colab if you prefer to follow along there with executable code.
Introduction
Prompt engineering is central to AI text-to-image generators such as Stable Diffusion, MidJourney, Disco Diffusion, or Dall-e 2. Developing good prompts is hard and we are only beginning to understand how to craft them in a way that will produce high-quality images. According to the Dream Studio prompt guide, the high-level components of a prompt are:
- The raw prompt (i.e. the subject of your image)
- Style, e.g. illustration, painting, concept art
- Artist, adding one or more artist can suggest the style of your final output.
You can then add various modifiers to these. Find hundreds of modifiers that Krea.ai has extracted from millions of prompts. For example, here is a link to their photography modifiers, this includes photorealistic, 35 mm photography, cinematic, dslr, volumetric lighting, and many others.
Which modifiers should you add? How important are various terms in a prompt? You’ll often see trending on artstation, 4k, 8k, hyper-detailed, digital art, and any number of other modifiers on advanced prompts. In many cases, people don’t know and simply throw any number of (possibly) arbitrary modifiers into a prompt in the hopes that they will make a difference.
But do they?
In this colab I created a rudimentary tool that uses CLIP to evaluate how important a particular term is in a prompt. Or more specifically, it checks how much the prompt changes in latent space once that term is added.
How it works
Most text-to-image generators use CLIP-guided diffusion. The CLIP part of the equation converts your prompt into a vector or long array of numbers. That vector represents a coordinate in a high-dimensional (latent) space (think XY coordinates in a 2-d plane, but many more dimensions). It can also be thought of as a line from the origin, i.e. (0,0) in 2-d space to the point.
CLIP’s party trick is to convert prompts into latent space in such as way that those that refer to similar concepts will be placed closely together, whereas unrelated topics are further apart.
Consider these prompts
- henry cavill as batman by anthony van dyck, cinematic, filmic, 8 k, cinematic lighting, insanely detailed and intricate, hyper realistic, super detailed
- superman taking a bath at the beach
- japan narrow street with neon signs and a girl with umbrella wearing techwear, digital art, sharp focus, wlop, artgerm, beautiful, award winning
The first two prompts are related through the concept of super heros, but are different to the last prompt. If CLIP does its job well, then once the prompts are converted to vectors, a good distance measure will show that the first two prompts are closer together or more similar, whereas the third prompt is far away.
Interpreting these vectors as representing lines from the origin, the angle between them will be small for similar concepts. As the concepts become increasingly unrelated, so does the angle between their vectors increase. We can use the cosine similarity (cossim) is a similarity measure that calculates exactly that.
The cosine of the angle between two identical vectors is 1. If the angle is 90 degrees (i.e. they’re orthogonal), the cosine is 0. So when comparing two vectors, cossim(A,B)
will return a number in the interval [0, 1].
While not a perfect measure, we can use it to rank the relative importance of terms in our prompts.
In the next few examples, we will use
henry cavill as batman by anthony van dyck, cinematic, filmic, 8k, cinematic lighting, insanely detailed and intricate, hyper-realistic, super detailed
as the base prompt. Using this prompt and the following settings on Stable Diffusion (steps: 25, sampler: Euler-a, seed: 1000), here is the generated image:
First a sanity check
What happens if we add a nonsense term to our prompt? Let’s add abcde to our prompt and calculate the similarity to the original prompt. When calculating cossim(batman_prompt, batman_prompt_with_abcd)
we get a similarity score of 0.96. This means that the nonsense term barely changes the prompt in latent space.
What about testing two very different prompts. Let’s compare our batman prompt with
japan narrow street with neon signs and a girl with umbrella wearing techwear, digital art, sharp focus, wlop, artgerm, beautiful, award winning
cossim(batman_prompt, japan_street_prompt)
gives us a similarity score of 0.22. It’s clear that these two prompts are very far apart in latent space.
Back to testing modifiers
How important are the following terms in the prompt?
- henry cavill
- as batman
- anthony van dyck
- cinematic
- filmic
- 8k
- cinematic lighting
- insanely detailed
- intricate
- hyper-realistic
- super detailed
To test how much each term changes the prompt, we use the following procedure:
- For each term
- Create a new prompt without that term.
- Calculate
cossim(prompt, prompt_without_term)
The result of the calculation tells us how much the prompt changes when we remove that term. The lower the number, the more dissimilar prompt
is to prompt_without_term.
The more dissimilar, the greater the strength of the term.
As an example, we can calculate the strength of the term batman, by comparing our base prompt to this new one:
henry cavill by anthony van dyck, cinematic, filmic, 8k, cinematic lighting, insanely detailed and intricate, hyper-realistic, super detailed.
The cosine similarity score is 0.75. This means that batman is indeed an important term. Here is the list of terms in increasing order of similarity (or decreasing order of strength of term)
- henry cavill: 0.68
- anthony van dyck: 0.73
- batman: 0.75
- cinematic: 0.84
- 8k: 0.88
- filmic: 0.88
- cinematic lighting: 0.91
- insanely detailed: 0.93
- hyper-realistic: 0.94
- intricate: 0.95
- super detailed: 0.98
henry cavill, anthony van dyck, and batman emerge as the most important terms. super detailed, with a similiarity of 0.98 has less impact on the prompt than our nonsense term abcd. It seems it barely makes any difference at all.
Let’s check this visually to be sure. Here are the outputs with and without super-detailed using Stable Diffusion (same parameters as above).
They’re not identical. Super detailed has smoothed out the furrows in Batman’s forehead. The lips are more realistic and the stare is more determined. These differences are tiny but I suspect that our brains are attuned to noticing these differences on a human face. We might be less able to discern subtly differences in pictures of landscapes or architecture.
Here are the images created after removing each term ordered by increasing similarity to the baseline prompt.
When interpreting the images above, each image represents the baseline prompt — one term. E.g. bp — henry cavill represents the prompt:
batman by anthony van dyck, cinematic, filmic, 8k, cinematic lighting, insanely detailed and intricate, hyper-realistic, super detailed
Subjectively, I think that the images above mostly match the similarity scores with the exception of order of the first two terms. bp — anthony van dyck seems the most dissimiliar to the baseline prompt, followed by bp-henry cavill
Changing the order
It turns out the change the order of the terms has an effect on the relative weights of the terms. This time, let’s move super detailed to the beginning of the prompt.
- henry cavill: 0.71
- anthony van dyck: 0.76
- batman: 0.76
- super detailed: 0.77
- cinematic: 0.86
- 8k: 0.89
- filmic: 0.89
- cinematic lighting: 0.92
- insanely detailed: 0.94
- intricate: 0.96
- hyper-realistic: 0.97
super detailed has shot up from 11th position to 4th, simply by moving it to the front of the prompt. This is what the image looks like:
Trending on artstation
Trending on artstation is one of the most commonly used modifiers. The excellent Krea.ai released a database of over 10 million prompts harvested from the Stable Diffusion Discord server. Of those, 2,566,593 (24%) contained the term artstation and 10% included trending on artstation. If it is used so often, it should be good, right?
It’s complicated.
The importance of a modifier depends on the context, the type of image, and other modifiers in the prompt. Two modifiers may contain much of the same information. Individually they may be powerful, but stacking them might not make much of a difference.
Let’s explore this with the following prompt:
Boris Johnson as Grim Reaper in a hood, portrait, highly detailed, digital painting, artstation, concept art, smooth, sharp focus, illustration, cinematic lighting, art by artgerm and greg rutkowski and alphonse mucha
what is the relationship between artstation and concept art in this prompt?
For the base prompt, let’s remove both artstation and concept art to form:
Boris Johnson as Grim Reaper in a hood, portrait, highly detailed, digital painting, smooth, sharp focus, illustration, cinematic lighting, art by artgerm and greg rutkowski and alphonse mucha
Then compare this baseline prompt to one which only contains concept art, one that only contains artstation, and one that contains both concept art and artstation.
Here are the results:
- Concept art only: 0.82
- Artstation only: 0.80
- Both concept art and artstation: 0.77
It’s clear that both modifiers individually affect the baseline prompt, with similarity scores of 0.82 and 0.80 respectively. If there was no overlap in the semantic meaning of these two modifiers, we might expect a similarity score close to 0.6. Instead stacking these two modifiers only results in a similarity score of 0.77. In this context, both concept art and artstation have similar meanings in latent space.
This can be seen more clearly in the visuals below:
Both concept art and artstation have a big impact on the original image, but there is very little difference when we add both at the same time.
Conclusion
Prompt engineering is not easy. A quick browse through a few prompt database websites such as Krea, PromptHero and Lexica reveals how subtle and technical prompts can actually be. Unfortunately, much of it is black magic. Modifiers are often added without understanding whether they will change the final image at all.
Rather than throwing everything at the wall and seeing what sticks, taking a more scientific approach may help with producing better results more reliably. Using quantitative techniques, better tooling can be made available to assist less experienced AI artists produce better images.
Another result of the examples above is that small movements in the latent space may in fact still be important, but in different ways to the stronger modifiers. While super detailed is not a strong modifier, it still changes the fine detail which can convert an image from somewhat off to spot on.
Perhaps understanding how modifiers affect the final image will improve the prompt engineering workflow. Knowing which modifiers will affect composition vs those that will change facial features can help with incrementally crafting a prompt.
I hope that the basic approach used in this colab might form the basis for more sophisticated tools in future.
Limitations and next steps
- I’m not sure that cosine similarity is the best distance measure to use in this context (in fact it is isn’t even a true metric). Since there are no units, we can only use it to sort terms in order of importance within a the current prompt. It isn’t even clear whether distances are linear, i.e. is difference between 0.6 and 0.7 the same as that between 0.7 and 0.8. Perhaps information gain might be better? I would appreciate any input on this.
- A term might have a big impact on the similarity between two prompts, but it still doesn’t tell us anything about whether the final result looks better. This might not be a problem that can be completely solved given the subjective nature of beauty, but perhaps we can make some headway by filtering images that we can generally agree are flops (e.g. poor composition, low coherence, etc).
- I would like to take this analysis further and explore pairs, triples,.. of terms that have similar semantic meaning (e.g. concept art and artstation).