Textual Inversion: A method to finetune Stable Diffusion Model

How textual inversion works and application of textual inversion in image synthesis

9 min readJun 13, 2023

Recent large text to image stable diffusion models have demonstrated unprecedented ability to synthesize novel scenes using text prompts. These text-to-image models offer freedom to guide creation through natural language. Their use is however restricted by a user’s ability to describe a particular or unique scene or an artistic creation or a new physical product. Many times, the user is constrained to exercise her artistic freedom to generate images of specific unique or new concepts. Also it is very difficult and expensive to retrain a model for each novel concept with a new dataset.

The paper An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion provides a simple approach to allow these creative freedom.

In my last blog, we discussed how stable diffusion works. In this blog, we are discussing an approach to turn our own cat into a painting, or imagine a new product based on our favorite toy using stable diffusion. We can learn these new user-provided concepts using only 3–5 images of an object or a style through new “words” in the embedding space of a frozen text-to-image model.

Text guided personalized generation result with textual inversion — Figure 1: Text guided personalized generation result using textual inversion

How does textual inversion work?

In our previous blog on stable diffusion, we saw that a text encoder model turns any input prompt into embeddings which are fed to the diffusion model as guidance or conditioning. We can also see in Figure 1 that this process involves tokenization of input prompts into a set of tokens which is an index in some predefined dictionary and then passing these tokens through text encoder to get embeddings. Each token is linked to a unique embedding vector that can be retrieved through an index based lookup.

These text embeddings are passed to downstream UNet model as guidance along with latent image input and t. We can also change token embedding for an object and replace it with token embedding for another object to get a different image. Similarly, we could learn a new token embedding for a specific object or concept.

In textural inversion, we choose this embedding space as the target for inversion. The idea behind ‘Textual Inversion’ is to use a few example images to teach a new word to the text model and train its embeddings close to some visual representation. This is achieved by adding a new token to the vocabulary and training with a few representative images. Thus, we try to find new embedding vectors that represent our new specific concepts. Textual inversion can be used to add a trained token to the vocabulary and use it with pre-trained Stable Diffusion model.

Thus, we designate a placeholder string which we will be calling as pseudo-word is denoted by S∗ as can be seen in Figure 1, to represent the new concepts we wish to learn. We intervene in the embedding process and replace the vector associated with the tokenized string with a new learned embedding v∗. Thus, we are able to inject our new concept into our vocabulary. This pseudo-word is then treated like any other word, and can be used to compose novel textual queries or new sentences for the generative models. One can therefore ask for “a photograph of S∗ on the beach”, “an oil painting of a S∗ hanging on the wall”, or even compose two concepts, such as “a drawing of S1 ∗ in the style of S2 ∗ ”. We dont make any change to the generative model in this process and so the base generative model can be used for our new concept.

The schematic representation of the process of textual inversion has been shown as follows by the author of the paper.

We can see that, we can get an image from a prompt “A photo of S*” where S* is a new object or new style the way we get an image from a prompt “A photo of cat”.

Text Embeddings

The task of finding embedding for these pseudo-words is framed as the task as one of inversion. Here, we use a fixed pre-trained text-to-image model and a small set of images depicting the concept. We aim to find a single word embedding, such that sentences of the form “A photo of S∗” will lead to the reconstruction of images from our small set. This embedding is found through an optimization process (Figure 4), which we refer to as “Textual Inversion”.

In the realm of Latent Diffusion Models, inversion is performed by adding noise to image and then denoising it through a network, but this process significantly changes the image content. In textual conversion, we invert a user provided concept and represent this concept as a new pseudo-word in the model’s vocabulary. The embeddings for the new token are stored in a small PyTorch pickled dictionary, whose key is the new text token that was trained. Since the encoder of our pipeline does not know about this term, we need to manually append it.

Training

The purpose of textual inversion is to enable prompt guided generation of new, user-specified concepts. We try to encode these novel concepts into an intermediate representation of a pre-trained text-to-image model. Thus, we search to represent these concepts in the word-embedding stage of the text encoders employed by text-to-image models. It has been found that this embedding space is expressive enough to capture basic image semantics.

Textual Inversion is implemented over a specific class of generative models named Latent Diffusion Model and a recently introduced class of Denoising Diffusion Probabilistic Models (DDPMs) that operate in latent space of an autoencoder. LDMs consists of 2 core components:

Autoencoder — The encoder learns to map images x into a spatial latent code z = E(x). Later, the decoder D maps the latents back to images such that D(E(x)) = x
Diffusion Model — It is trained to denoise a normally distributed variable. The diffusion model can be conditioned on class labels, texts, semantic maps or other image to image translation tasks. Let cθ(y) be a model that maps a conditioning input y into a conditioning vector. The LDM loss is given by:

Latent Diffusion Model Loss — Figure 3: Loss in a Latent Diffusion Model

where t is the time step, zt is latent image representation noised to time t, ε is the unscaled noise sample, and ε(θ) is the denoising network. We aim to correctly remove the noise added to a latent representation of an image. While training, c(θ) and ε(θ)are jointly optimized to minimize the LDM loss. I will be providing the code for training and inference in my future blog.

Textual Inversion

To find these new embeddings, we use a small set of images (typically 3–5), which depicts our target concept across multiple settings such as varied backgrounds or poses. We find v∗ through direct optimization by minimizing the LDM loss as in Figure 3 over images sampled from the small set. We randomly sample neutral context texts like “A photo of S∗”, “A rendition of S∗”, etc to condition the generation.

Our optimization goal can then be defined as:

Optimization of Latent Diffusion Model Loss — Figure 4: Optimization of Latent Diffusion Model for learned embedding to capture a concept

This goal is realized by re-using the same training scheme as used by the original Latent Diffusion Model, while keeping both c(θ) and ε(θ) fixed. This is a reconstruction task and the learned embeddings are expected to capture fine visual details unique to the novel concept.

Application of Textual Inversion

Textual inversion can be used in a range of applications comprising mainly of the following:

Image Variation
Text Guided Synthesis
Style Transfer
Concept Composition
Bias Reduction
Downstream Application

We will be discussing these applications and their use case in the following section:

Image Variation

Object Variation generated using textual inversion — Figure 5: Object Variation generated using Textual Inversion

Textual Inversion can be used to create variations of an object using a single pseudo-word as can be seen in Figure 5. It is able to capture finer details of the object using a single word embedding.

2. Text guided synthesis

Figure 6: Text guided personalized generation results using textual inversion

Textual Inversion can be used to create novel scenes by incorporating learned pseudo-words along with new conditioning texts. We can see from generated images in Figure 6 that new scenes can be created by leveraging the semantic concepts in the pseudo words and new conditioning texts. Since the model is built using pre-trained, large scale text-to-image synthesis model, a single pseudo word can be reused for multiple generations.

3. Style Transfer

Abstract styles including style representation using textual inversion — Figure 7: Representation of abstract concepts inclusing styles using textual inversion

This is one of most typical use case of textual inversion wherein a user can draw unique style of a specific artist and apply it to new creation. Textual Inversion model can find pseudo-words representing to a specific unknown style as well. We can provide the model with a small set of images with a shared style and replace training texts with style such as “A painting in the style S*. It should be noted that differes from a traditional style transfer method.

4. Concept Composition

Compositional generation using two learned pseudo-words — Figure 8: Concept Composition generation using two learned pseudo-words

Textual Inversion can also be used for compositional synthesis wherein multiple learned concepts are used in the guiding text. It is seen that the model can concurrently reason over multiple novel concepts or pseudo-words at the same time. But it fails to place two concepts side by side. The reson for this can due to the fact that the training samples includes images from only single concept.

5. Bias Reduction

Bias reduction using textual inversion — Figure 9: Bias Reduction using textual inversion

The images generated from text to image is biased due to the training data used to train them. Thus, these biases can also be seen in the generated samples as can be seen in Figure 9. It can be seen that textual inversion can reduce bias in training dataset by incorporating more inclusive dataset leading to a more inclusive generation.

6. Downstream Applications

Textual Inversion used with downstream models built with LDM — Figure 10: Application of textual inversion with downstream models built with Latent Diffusion Models

The pseudo words generated by textual inversion can also be used in downstream models that build on same latent Diffusion Model. For example, Blended Latent Diffusion, wherein localised text based editing of images via mask based blending process can be done, can be conditioned along with learned pseudo words without requiring much change in original model.

The concept of textual inversion is important from the following 2 reasons:

We can do personalized text to image generation by synthesizing novel scenes of user-provided concepts guided by natural language instructions
The idea of “Textual Inversions” in the context of generative models aims to find new pseudo-words in the embedding space of a text encoder that can capture both high-level semantics and fine visual details of a new concept.

Note — Most of the images in the blog are taken from the paper.

References:

Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A. H., Chechik, G., & Cohen-Or, D. (2022). An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618.
https://huggingface.co/docs/diffusers/

Textual Inversion: A method to finetune Stable Diffusion Model

How textual inversion works and application of textual inversion in image synthesis

Written by Onkar Mishra