Automatically Selecting the Best Images Generated by Stable Diffusion

5 min readJul 20, 2023

tl;dr

Analyzing your prompt-image pairs with embedding-based clustering and interactive analysis tools can significantly help you speed up your selection of images and prompts. Check out the following resources:

Spotlight, a tool especially suitable for visualizing image-text pairs and embeddings.
sliceguard, a library to determine clusters in your data where your model (Stable Diffusion) works exceptionally well or badly.
Example notebook containing the code for this post.
Huggingface Space containing an interactive visualization of the results.

Motivation

Recently Stability AI released the newest version of their Stable Diffusion model, named Stable Diffusion XL 0.9. After playing around with it for a bit, I found the results were quite impressive.

Selecting the best pictures generated by Stable Diffusion automatically can be achieved by embedding-based clustering and the CLIP Score metric.

However, I quickly asked myself the following question:

When generating a bunch of candidate images from a collection of different prompts, how do I select the best outputs?

Concretely I found myself in two different settings where I had a similar problem:

I wanted to explore a dataset of many very different prompts and check where SD gives me acceptable results to find promising prompts for my use case as a starting point or even uncover new uses in the first place.
I already had a concrete task in mind and drafted a dataset of several task-specific prompts. Now I also wanted to select the best prompts for my task.

I quickly came to a point where I realized that a purely manual image selection can be quite bothersome, so I had a look at techniques that had automation potential. Even just narrowing down the number of images from thousands to hundreds would probably save me a lot of work.

Starting Point and Plan

I quickly came across this article on the Huggingface page that mentioned the CLIP Score as a quantitative metric for judging the output quality of generated images.

The CLIP Score is a reference-free evaluation metric for image captioning models. It basically measures the similarity between a text prompt and the contents of an image. (Hessel et al.)

So, my goal was to check if the CLIP score is really suitable for selecting the best images from a large set of candidate prompts and images or at least helps narrow down the candidate set.

Therefore, I generated around 1500 images from a prompt dataset I found on the Huggingface hub and went for it.

Recipe for Image Selection on Your Data

I created a notebook that covers everything, from generating the images, to filtering the best images according to CLIP Score. If you want to run this yourself start there. If you want to just interactively explore results, check out this Huggingface Space.

Roughly my recipe can be divided into the following steps:

Calculate the CLIP Score for all prompt-image pairs to measure generation quality.
Generate CLIP Embeddings to be able to calculate a similarity between images (or texts)
Embedding-based identification of clusters that have an exceptionally high CLIP Score.

Step 1: Calculating the CLIP Score

This is fairly easy as this is implemented in the torchmetrics library.

The CLIP scores can already be used to filter out promising renderings such as this cute Capybara. (Tool: Spotlight)

This step will give you a score for each prompt-image pair that captures how well the image contents are matching the prompt.

Step 2: Calculating CLIP Embeddings

For calculating CLIP Embeddings you can use the Huggingface transformers library. Just instantiate a CLIP model and extract text and image embeddings in one go.

Ordering their data by their embedding (Project Embedding on 2D Plot) lets us find some friends for the Capybara that also have a high CLIP Score. (Tool: Spotlight)

Using Tools such as Spotlight or just applying a dimensionality reduction technique lets us already explore the images according to similarity. Coloring by CLIP Score additionally gives you a sense of which types of images seem to work especially well. You could now already go through the clusters you identify visually and select promising images and prompts interactively.

Step 3: Automatically Find Clusters of Nice Images

Maybe you have just a large amount of data; maybe you just want to narrow down your choices a little bit. There are ways to just automatically identify clusters of images/prompts in the data where Stable Diffusion worked exceptionally well. One of them is using sliceguard, a library for identifying clusters where your model works exceptionally well or badly.

It all boils down to one call here:

sg = SliceGuard()
sg.find_issues(df, ["clip_image_embedding"],
               "clip_score",
               "clip_score",
               return_precomputed_metric,
               metric_mode="min",
               min_support=2,
               min_drop=4.5,
               precomputed_embeddings={"clip_image_embedding": clip_image_embeddings})
sg.report()

This will just look for clusters in the data with a minimum number of two samples with an average CLIP Score that is at least 4.5 higher than the mean of the whole dataset. Check the notebook for more details!

Automatic clustering algorithms help you identify similar groups in the data based on image and text embeddings. In this case, famous people eating or cooking food. (Tools: sliceguard and Spotlight)

From applying sliceguard you will get a list of data clusters that are judged exceptionally well using the CLIP Score metric. Depending on which thresholds you set, you can adjust how much data you want to filter and how much you want to keep for manual inspection.

Conclusion

Applying the CLIP Score and Embedding-based Clustering for selecting promising prompts and images that are used or generated by stable diffusion is for sure an interesting approach. However, I made the experience that the CLIP Score metric seems to be heavily biased toward certain types of images. In particular, especially the following two cases that yield high CLIP Scores were apparent:

Portraits of people were judged way higher than landscapes, animals, things.
Especially generated images of prominent persons that are probably highly overrepresented in the training data and easy to associate with certain images are rated even higher.

Therefore, in the notebook, I propose an extension of the approach that is based on identifying categories of images based on embeddings beforehand. Those categories can then be searched for promising images one-by-one, which results in a higher variety of image categories being present in the final selection.