Personalized Stable Diffusion with Few-Shot Fine-Tuning

Create Your Own Stable Diffusion on a Single CPU

Intel(R) Neural Compressor
Intel Analytics Software
3 min readNov 1, 2022

--

Haihao Shen, Wenhua Cheng, Kaokao Lv, Weiwei Zhang, and Huma Abidi, Intel Corporation

Stable Diffusion is a state-of-the-art latent text-to-image diffusion model that generates photorealistic images from text. It has recently become popular for creating realistic and futuristic images based on the user’s imagination. However, training Stable Diffusion is expensive, taking 150K hours on 256 Nvidia A100 GPU cards as described in model card.

In this blog, we describe how to create personalized Stable Diffusion models through few-shot fine-tuning. We just use one image to fine-tune stable diffusion on a single CPU and demonstrate the inference of text-to-image. To the best of our knowledge, this is the first demonstration of an end-to-end stable diffusion workflow from fine-tuning to inference on a CPU.

We plan to introduce low-precision optimizations using Intel Neural Compressor in future articles to boost inference performance.

Personalized Stable Diffusion adds new concepts (also called objects) to be recognized by the model while maintaining the capabilities of the pretrained model on text-to-image generation. Here is a sample concept “dicoo” from the Stable Diffusion concepts library:

Sample image for “dicoo”

Textual Inversion is a technique to understand new concepts from a small number of images in a way that can later be used in text-to-image pipelines, where new text objects are learned in the embedding space of the text encoder in the pipeline while the other model parameters are frozen. This technique meets the requirements for creating a personalized Stable Diffusion.

Next, we use “dicoo” as an example to demonstrate the few-shot fine-tuning to create our own Stable Diffusion. First, we prepare the images similar to other fine-tuning tasks. We are using only one image, so we select one random image from the concepts library. We leveraged the fine-tuning script provided by Textual Inversion with one prepared image under train_data_dir and “dicoo” for placeholder_token (as a new object):

Few-shot fine-tuning using BFloat16

Hugging Face’s diffusers library provides a high-level device abstraction, so it’s relatively easy to migrate the default model script and run the fine-tuning on a CPU. Moreover, we are enabling automatic mixed-precision to accelerate fine-tuning by using BFloat16 on Intel processors. Finally, we generate our own Stable Diffusion on a single CPU in less than three hours. And now, let’s test it!

Inference

We use the prompt “a lovely <dicoo> in red dress and hat, in the snowy and brightly night, with many brightly buildings” and generate the below image:

Image generated from the given prompt

As you can see the generated image looks pretty good. The new object “dicoo” is well-captured, while the other objects and relevant text semantics remain. We consider this a successful personalized Stable Diffusion!

We released all the code and scripts under Intel® Neural Compressor. Please check it out now and create your own Stable Diffusion!

As mentioned early, our next step is to enable low-precision optimizations to accelerate the inference on Intel platforms. Please star this project if you like it so you will receive notifications about our latest optimizations.

Please visit the Intel Neural Compressor and AI Kit pages and feel free to reach us if you have any questions!

Updates:

2022/11/18: we used Stable Diffusion to create a nice X’mas image for dicoo. Have fun!

2022/11/22: we provided an online demo for dicoo diffusion model. Try it out: https://huggingface.co/spaces/Intel/dicoo_diffusion :)

--

--