Using Long Prompts with the Diffusers Package with Prompt Embeddings
Overcoming diffusers CLIP 77 token limit.
In a previous article, I demonstrated how to use the diffusers
package to generate synthetic images from text prompts. It turns out that generating highly detailed images require the use of extremely detailed prompts.
Unfortunately, diffusers
by default truncates tokenized prompts more than 77 tokens long! This greatly restricts the amount of information we can include in the prompts, and therefore restricts the amount of detail we can generate in the image.
Thankfully though there are tricks to overcome this restriction, and in this article we will explore how to use prompt embeddings to overcome this 77 token limit!
Diffuser’s CLIP Truncates Long Prompts
For example, this image uploaded by wnwntk511 on CivitAI uses an extremely long prompt containing 56 words which is tokenized to 256 tokens by CLIP. If this set of 256 tokens is input into the diffusers
pipeline directly, the following warning message will be displayed.
images = pipe(
prompt = prompt,
negative_prompt = negative_prompt,
width = width,
height = height,
guidance_scale = guidance_scale,
num_inference_steps = num_inference_steps,
num_images_per_prompt = 1,
generator = torch.manual_seed(seed),
).images
The following part of your input was truncated because CLIP can
only handle sequences up to 77 tokens:
['cinematic lighting, delicate illustration, official art,
aesthetic : 1. 4 ), ( golden - ratio face, perfect proportioned face,
perfectly beautiful ), glossy and red lips, ( brown eyes ),
senior high school student, ( k - pop idol, miss korea, korean beauty ),
( short torso, long legs, slim waist, huge hips, huge naturally
sagging breasts : 1. 4 ), ( 1 girl, solo girl : 1. 3 ),
( cinematic view of a wonderful burning castle built in a middle of
a dark wood, wide view, fire and ashes, cloudy night : 1. 3 ),
( medieval plate armour, full body armor suit with chain mail : 1. 3 ),
( crusader knight, cape, hold a shield and very long sword, fighting,
wield a sword : 1. 4 ), ( carrying a big bow on back : 1. 2 ),
chastity belt,']
As implied by the warning, a good part of the prompt is truncated and not input into the diffusion model, and as a result the generated images do not contain the truncated elements such as medieval plate armour
or full body armor suit With chain mail
or crusader knight
.
This means that we either reduce the prompt’s length by removing words — however as mentioned this will also reduce the amount of detail we can generate.
Fortunately there are some tricks we can use with the diffusers
package to get over this restriction, and input long prompts into the pipeline. One trick is to use prompt embeddings.
Use Prompt Embeddings to Encode Long Prompts
Patrick von Platen suggests using prompt embeddings with the diffusers
StableDiffusionPipeline
.
The first step is to tokenize the prompt and the negative prompt, assuming that the prompt is longer than the negative prompt.
input_ids = pipeline.tokenizer(
prompt,
return_tensors="pt",
truncation=False
).input_ids.to("cuda")
negative_ids = pipeline.tokenizer(
negative_prompt,
truncation=False,
padding="max_length",
max_length=input_ids.shape[-1],
return_tensors="pt"
).input_ids.to("cuda")
For example, the long prompt provided by wnwntk511 on CivitAI are tokenized to a Tensor with 256 elements (which will be truncated to 77 by CLIP):
input_ids
tensor([[49406, 263, 1125, 268, 16157, 267, 1532, 1125, 267, 12066,
267, 949, 3027, 267, 279, 330, 267, 84, 4414, 267,
6323, 1125, 267, 21682, 912, 620, 267, 1994, 9977, 267,
8157, 4353, 267, 8118, 12609, 3575, 267, 16157, 3575, 267,
1215, 22984, 28215, 267, 16575, 4605, 12609, 267, 11444, 537,
29616, 2353, 267, 3077, 27897, 267, 1860, 3073, 267, 3049,
3073, 267, 664, 23752, 3575, 267, 18105, 2812, 21339, 267,
949, 6052, 267, 949, 7068, 267, 25602, 5799, 267, 18768,
6052, 267, 1868, 794, 267, 16179, 281, 272, 269, 275,
2361, 263, 3878, 268, 15893, 1710, 267, 1878, 35016, 9083,
1710, 267, 8037, 1215, 2361, 29802, 537, 736, 8847, 267,
263, 2866, 3095, 2361, 3415, 1400, 1228, 2556, 267, 263,
330, 268, 2852, 8884, 267, 1526, 5915, 267, 8034, 2488,
2361, 263, 3005, 937, 706, 267, 1538, 7072, 267, 12364,
18459, 267, 2699, 30191, 267, 2699, 12995, 6609, 3905, 37637,
281, 272, 269, 275, 2361, 263, 272, 1611, 267, 5797,
1611, 281, 272, 269, 274, 2361, 263, 25602, 1093, 539,
320, 2582, 8405, 3540, 3874, 530, 320, 3694, 539, 320,
3144, 1704, 267, 3184, 1093, 267, 1769, 537, 12587, 267,
13106, 930, 281, 272, 269, 274, 2361, 263, 10789, 5135,
18503, 267, 1476, 1774, 16167, 3940, 593, 3946, 2614, 281,
272, 269, 274, 2361, 263, 40171, 7355, 267, 6631, 267,
3254, 320, 8670, 537, 1070, 1538, 11356, 267, 4652, 267,
820, 5684, 320, 11356, 281, 272, 269, 275, 2361, 263,
9920, 320, 1205, 4040, 525, 893, 281, 272, 269, 273,
2361, 42176, 696, 7373, 267, 49407]], device='cuda:0')
After tokenizing the prompt and negative prompt, the next step is to encode the tokens into embeddings. The result for the prompt is a Tensor of shape [1, 256, 768]
.
concat_embeds = []
neg_embeds = []
for i in range(0, input_ids.shape[-1], max_length):
concat_embeds.append(
pipe.text_encoder(
input_ids[:, i: i + max_length]
)[0]
)
neg_embeds.append(
pipe.text_encoder(
negative_ids[:, i: i + max_length]
)[0]
)
prompt_embeds = torch.cat(concat_embeds, dim=1)
negative_prompt_embeds = torch.cat(neg_embeds, dim=1)
prompt_embeds.shape
torch.Size([1, 256, 768])
The prompt and negative prompt embeddings can then be fed into the pipeline without being truncated, and the diffusion model will use the full prompt to generate images!
new_img = pipe(
prompt_embeds = prompt_embeds,
negative_prompt_embeds = negative_prompt_embeds,
width = width,
height = height,
guidance_scale = guidance_scale,
num_inference_steps = num_inference_steps,
num_images_per_prompt = 1,
generator = torch.manual_seed(seed),
).images
This time round with prompt and negative prompt embeddings, the diffusion model generates details which were originally truncated by CLIP, such as medieval plate armour
or full body armor suit With chain mail
or crusader knight
!
The solution provided by Patrick von Platen above does not take into consideration however what happens if the negative prompt is longer than the prompt. An improved solution was posted by Andre van Zuydam on GitHub which we recommend readers to refer to (we will not copy the code here!). A modified version of this improved solution was used to generate the image above!
References
- https://civitai.com/images/1321964?modelVersionId=105253&prioritizedUserIds=587535&period=AllTime&sort=Most+Reactions&limit=20
- https://github.com/huggingface/diffusers/issues/2136