Training a Stable Diffusion Model with Textual Inversion Using Diffusers

Artsem Holub
5 min readSep 9, 2022

--

While browsing the internet for interesting technologies in the field of neural networks and various arts, I came across a Twitter post in which Suraj Patil announced the possibility of training a Stable Diffusion model of textual inversion using only 3-5 images.

This news spread very quickly in the vastness of the English-speaking community (although not everyone knows what it is), but in the vastness of the Russian-speaking community, even for several days there was not a single mention of this. Therefore, I decided to talk about it, as well as provide code that you can test yourself. Some questions can be answered at the end of the article.

What is textual inversion?

I will try to explain it in a more understandable language.

Textual inversion allows you to train a model of a new "concept" that you can associate with a "word" without changing the weights, instead fine-tuning the text embedding vectors.

So, for example, I added a bunch of photos of a plush penguin. I was then able to ask the model to, for example, create a drawing of my particular plush penguin. Or generate a photo of my particular plush penguin sitting on top of a mountain. The goal is to allow you to import a specific idea/concept from photographs/images that were not generated by the model and learn how to represent them. It can also be used to represent styles. So let's say you've uploaded a bunch of drawings by a certain artist, then you can ask them to generate images in that particular style.

Using textual inversion

Now I will show how you can train Stable Diffusion with text inversion. This code can also be found in the github repository or use this colab.

Before starting work, I advise you to register with HuggingFace and get an access token with the "write" access setting. Now you can get started.

In a colab, the code is located in cells that can be run individually, but I will try to explain the main points that often cause problems.

  • First you need to install the necessary libraries and go to HuggingFace. For this, we need an access token. Signing into HuggingFace will allow you to save your trained model and share it with the Stable Diffusion content library if needed.
  • In the "Settings for teaching your new concept" section, you must select a checkpoint on which all training will be tied. By default, there is a checkpoint "CompVis/stable-diffusion-v1-4". You can specify any other.
  • Next, work with the dataset begins. The official text inversion colab allows you to use direct links to images, but for my project with the inheritance of the style of the artist Ilya Kuvshinov, I used Google disk on which I have about 1000 images of art by this artist (I used only 30 images). To do this, I logged into my colab account and using the shutil module copied the folder from my Google drive:
  • Next, I had to slightly modify the code so that the model could use my images:
  • Now we can move on to setting up the model. In the "what_to_teach" section, you can choose an object (allows you to teach the model new objects) or a style (as in my case, the style of the artist’s images). In the "placeholder_token" section, you must specify a name by which you can later use it on the trained model. I have this <kuvshinov-style>. In the "initializer_token" section, you must specify 1 word that will show what your model is about. I have this kuvshinov.
  • Now we can start training the model. To do this, simply run the cells found in this section "Teach the model a new concept (fine-tuning with textual inversion)". You can also see what parameters are used there and if you are interested, even edit them. Sometimes, during the launch of the cells in this section, an error may occur in which the code swears at the wrong initializer_token. Most often, this error is thrown by one and also a cell, so you can comment out an unnecessary piece of code like mine. The name of this cell is "Get token ids for our placeholder and initializer token. This code block will complain if initializer string is not a single token". After that, you can continue to run the code
  • Model training takes an average of +/- 3 hours. After that, you can test the resulting model and publish it to the Stable Diffusion Content Library
  • In order to test and publish the model, you need to go to the "Run the code with your newly trained model" section. In the "Save your newly created concept to the library of concepts" box, you can specify the name of your concept, as well as choose whether to publish the model or not. If you decide to publish it, then in the hf_token_write line you must specify the access token that you specified above when entering HuggingFace. Be sure to check what access this token has. There are only 2 of them: reading and writing (write). We need a write token.
  • After that, you can run all the other cells one by one. In the last cell in the prompt line, you must specify your text prompt in English and indicate your placeholder_token. For example: a grafitti in a wall with a <kuvshinov-style> on it

Question and answer

  • How many resources are required to train the model?

To train my model, I used 30 1024x1024 images, but you can also use fewer images. The developers of this technology for their tests chose from 3 to 5 images.

  • Does the learning rate depend on the size of the dataset?

No, it doesn’t. I tried using both 5 and 30 images and on average the learning rate remained the same.

  • "To save this concept for reuse, download the learned_embeds.bin file or save it to the concept library." Does this mean I can use this on my local diffusion stable that I'm already running? How do I do this, where will this .bin file go, and how do I tell the program to use it?

Yes, you can run your model on your local computer, but you need to make sure you have enough video memory, because at the heart of all this is the Stable Diffusion model which requires 8 GB of video memory. To see how it works use this colab.

  • Can this be used to add more/new pop culture stuff to have better results for stuff that needs more data in the existing set, and also to add new stuff that isn't in the dataset so SD can output data on the specified things?

Yes. You can.

Thank you for reading this article. This is my first post so don’t judge too harshly. If you want to read more about this, check out the diffusers and notebook github. You can also ask me some questions, although I am not a qualified specialist and cannot tell you exactly about it, but I will help you in something.

--

--