Improving Custom Diffusion on Difficult Combinations of Concepts

Wilton Lam
Crater Labs
Published in
7 min readMar 23, 2023

Text-to-image deep learning models, like stable diffusion, leverage advanced neural networks to generate high-quality visual content from textual descriptions, transforming the creative landscape. A user can manipulate numerous characteristics of the generated image. For instance, should an individual desire the model to depict a cat, they may stipulate its colour, dimensions, and whether it is situated in an indoor or outdoor setting, among other attributes. However, should one desire to create an illustration of their specific cat rather than a generic representation of a cat, then fine-tuning is necessary. Fine-tuning, a process of further training these pre-trained models on specific datasets, enables more precise and contextually relevant outputs, enhancing their effectiveness across diverse applications. Custom Diffusion is an efficient fine-tuning method for quickly teaching a text-to-image model like Stable Diffusion new concepts with just a few examples, while also allowing for joint training and composition of multiple new concepts with existing ones. The method outperforms several baselines and concurrent works such as Textual Inversion and Dreambooth while being memory and computationally efficient. However, the custom diffusion struggles with composing difficult combinations of concepts, such as a dog and a cat, and struggles with compositions of three or more concepts. We hereby propose an improvement to solve the problem with composing a dog and a cat.

Background

The authors of Custom Diffusion propose to only update the key and value projection matrix parameters (Wk and Wv) of the cross-attention block (attn2.to_k and attn2.to_v) during the fine-tuning process as these parameters have relatively higher importance compared to other layers and are sufficient to update the model with a new text-image paired concept.

Diagram cited from [1]

The text-to-image cross-attention layer is a type of layer used in neural network architectures for generating images from textual descriptions. It enables the model to attend to relevant parts of the input text while generating the corresponding image. Specifically, the layer computes the cross-attention between the textual input and the visual features, where the visual features represent the image at the current stage of generation. The cross-attention mechanism helps the model to focus on the most relevant parts of the text when generating specific features of the image, and vice versa. This can improve the quality and accuracy of the generated images, as the model is able to better understand the relationship between the textual input and the visual output.

The self-attention mechanism in the output image of a text-to-image diffusion model is a technique that allows the model to focus on different parts of the image while generating it. This is achieved by computing a set of attention scores for each pixel in the image, which are used to weight the contribution of different pixels to the final output. This helps the model to better capture the complex relationships between different parts of the image and generate more accurate and coherent images that correspond to the input text.

The original Custom Diffusion model with unfrozen key and value of the cross-attention layers was evaluated through fine-tuning using a combination of new cat and new dog images. <https://www.cs.cmu.edu/~custom-diffusion/assets/data.zip>. Special word tokens, such as <new1> and <new2>, are employed to denote the particular cat and dog used for training.

Sample training image data of <new1> dog [1]
Sample training image data of <new2> cat [1]

Upon completion of the fine-tuning process, the model has acquired the capability to generate images of the designated cat or dog subject within a novel environment and from a unique perspective.

A generated sample of the <new1> dog
A generated sample of the <new2> cat
Example cross-attention visualization for a generated sample with the prompt “the <new1> dog”, where the intensity of each square indicates its level of importance, with brighter squares representing higher importance

Now we test whether it is able to depict the two animals together, using the prompt “the <new2> cat, and the <new1> dog”. The following is an example of six outputs that this model will generate, each with the same prompt as above but with different random number seeds:

Image samples generated by the model with the prompt “the <new2> cat, and the <new1> dog” with different random seeds

The model exhibits a failure to generate distinct representations of both a cat and a dog, instead producing two identical cats.

Cross-attention map for the prompt and one of the two identical cats image output. [2]

As shown by the cross-attention map, the model is unable to isolate the appropriate animal from the relevant text. This may explain the failure to draw two different subjects.

Method

We speculated if fine-tuning additional terms in the attention blocks could help the model isolate each subject. Hence, the entire self-attention block (attn1) was unfrozen to enable it to learn self-attention, and the resulting configuration is referred to as self-crossattn-kv, which contrasts with the frozen-model configuration, where the remaining model components remain unchanged. The resulting output image looks similarly to the original model, but the cat itself is drawn with higher precision.

class CustomDiffusion(LatentDiffusion):
def __init__(self,
freeze_model='crossattn-kv',
cond_stage_trainable=False,
add_token=False,
*args, **kwargs):

self.freeze_model = freeze_model
self.add_token = add_token
self.cond_stage_trainable = cond_stage_trainable
super().__init__(cond_stage_trainable=cond_stage_trainable, *args, **kwargs)

self.layer_dict = {'crossattn-kv': ['attn2.to_k', 'attn2.to_v'], 'crossattn': ['attn2'], \
'self-crossattn-kv': ['attn2.to_k', 'attn2.to_v', 'attn1'], 'self-crossattn': ['attn2', 'attn1']}

for x in self.model.diffusion_model.named_parameters():
x[1].requires_grad = False
if 'transformer_blocks' in x[0]:
for item in self.layer_dict[self.freeze_model]:
if item in x[0]:
x[1].requires_grad = True

def configure_optimizers(self):
lr = self.learning_rate
params = []
for x in self.model.diffusion_model.named_parameters():
if 'transformer_blocks' in x[0]:
for item in self.layer_dict[self.freeze_model]:
if item in x[0]:
params += [x[1]]
print(x[0])
# ......

Modified code snippet of the constructor of the CustomDiffusion object as well as the configure_optimizers() method of the model.py file of Custom Diffusion, showing the unfreezing of the corresponding attention layers. The crossattn-kv and crossattn modes are written by the original authors. self.layer_dict is a dictionary that sets which layers containing those keywords will be unfrozen during fine-tuning. The link to the code is at the end of this article.

Initially, we endeavoured to unfreeze the self-attention blocks, denoted as self-crossattn-kv in the source code. Following fine-tuning, the outcome for the identical prompt is:

Output image of prompt “the <new2> cat, and the <new1> dog”, using self-crossattn-kv mode where the self-attention blocks are unfrozen. Note the higher precision of the cats than the original mode.

It is evident that the model remains incapable of generating two images depicting the cat, albeit there is a marginal improvement in the visual quality compared to its previous outputs.

The depicted cross-attention map indicates that the model remains incapable of discerning the pertinent animal from the pertinent text, as both cats are accentuated in response to the ‘dog’ keyword.

The cross-attention attn2 layers, including the parameters for the query matrix (Wq) in the cross-attention block (attn2.to_q), have been unfrozen entirely. This unfreezing action is referred to as the self-crossattn mode within this study. With our query out of six images, we obtained a single accurately synthesized sample, one sample that is almost accurate but fragmented, while the remaining samples generate at least two distinct cat instances.

Image samples generated by the model using self-crossattn mode.

The cross-attention map corresponding to the accurate image indicates that the model has effectively identified and isolated the appropriate region of the image corresponding to the keyword “dog”.

Attention map for self-crossattn mode.

We provided a more complex prompt, emphasizing the requirement for the combination of two distinct animals: “1 cat and 1 dog together, the <new2> cat, and the <new1> dog, two different animals”. This yielded one accurate response, one that was nearly correct but separated, and the remainder offered two distinct types of cats.

Output images of new prompt “1 cat and 1 dog together, the <new2> cat, and the <new1> dog, two different animals”.
Attention map of the correct output image of the new prompt.

In conclusion, our team has made progress in enhancing Custom Diffusion by unfreezing both the self-attention and cross-attention layers. This has enabled the model to generate complex concepts, including the depiction of a cat and a dog together in a possibly more accurate and precise manner. While the model has shown promising results in generating correct samples, a thorough analysis is necessary to discern the underlying reasons for its greater proficiency in drawing cats as opposed to dogs. This investigation into the illustration of cats and dogs represents merely a singular instance. Consequently, it would be beneficial to expand the scope of this study to encompass the depiction of a broader range of subjects and concepts.

References

[1] Kumari, Nupur, et al. “Multi-Concept Customization of Text-to-Image Diffusion.” arXiv preprint arXiv:2212.04488 (2022).

[2] benkyoujouzu, “stable-diffusion-webui-visualize-cross-attention-extension,” GitHub Repository, 2023. [Online]. Available: https://github.com/benkyoujouzu/stable-diffusion-webui-visualize-cross-attention-extension

Code

Link to the code: https://github.com/lamwilton/custom-diffusion

--

--