Diffusion mini-summaries #2: An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Published in

ML Summaries

3 min readJan 10, 2023

Textual Inversion: Learn the word for a given subject by optimizing the language encoder which enables us to generate new images of the given subject in new settings.

Paper: https://arxiv.org/abs/2208.01618

If someone asks you what the following image is, you say “teapot”. If I ask you for more details, maybe you say, “colorful teapot”. But I will be imagining a completely different colorful teapot in my head. Text-to-image models also face this same problem.

This paper tries to learn the “word” which encompasses a bit more information on the subject, such that we can generate more images of the given subject in many new scenarios.

In typical T2I models, there is a text encoder and there is an image generation module. In this paper, the authors freeze the image generation part and *fine-tune only the text module* using the “subject” images.

For a given “subject” image, we create a template caption like, “A photo of S*”. And we want to learn the embeddings of S* by optimization such that when we use “A photo of S*” text for image generation, we should get back the training image.

The optimization objective is very simple. Train the model on regular diffusion loss on the small subset of “subject” images. And v* is the embedding of S*. Find the best embedding such that the loss is minimized on the training images.

The authors conducted experiments on stable diffusion and they fine-tune for 5000 steps with this new optimization objective. The authors also initialize the v* with a coarse description of the object. Like “teapot” for this example we have seen before.

We can now use the S* as vocabulary to describe the subject and can generate in different settings, like the following example.

We can also use this technique to learn a word to represent a style, and can generate more images in that style!!

Some useful links:
Official implementation — https://github.com/rinongal/textual_inversion
Huggingface colab — https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/sd_textual_inversion_training.ipynb

Diffusion mini-summaries #2: An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Written by Gowthami Somepalli