Using Long Prompts with the Diffusers Package with Prompt Embeddings

Overcoming diffusers CLIP 77 token limit.

Y. Natsume
5 min readJul 27, 2023

In a previous article, I demonstrated how to use the diffusers package to generate synthetic images from text prompts. It turns out that generating highly detailed images require the use of extremely detailed prompts.

Unfortunately, diffusers by default truncates tokenized prompts more than 77 tokens long! This greatly restricts the amount of information we can include in the prompts, and therefore restricts the amount of detail we can generate in the image.

Thankfully though there are tricks to overcome this restriction, and in this article we will explore how to use prompt embeddings to overcome this 77 token limit!

Diffuser’s CLIP Truncates Long Prompts

For example, this image uploaded by wnwntk511 on CivitAI uses an extremely long prompt containing 56 words which is tokenized to 256 tokens by CLIP. If this set of 256 tokens is input into the diffusers pipeline directly, the following warning message will be displayed.

images = pipe(
prompt = prompt,
negative_prompt = negative_prompt,
width = width,
height = height,
guidance_scale = guidance_scale,
num_inference_steps = num_inference_steps,
num_images_per_prompt = 1,
generator = torch.manual_seed(seed),
).images



The following part of your input was truncated because CLIP can
only handle sequences up to 77 tokens:
['cinematic lighting, delicate illustration, official art,
aesthetic : 1. 4 ), ( golden - ratio face, perfect proportioned face,
perfectly beautiful ), glossy and red lips, ( brown eyes ),
senior high school student, ( k - pop idol, miss korea, korean beauty ),
( short torso, long legs, slim waist, huge hips, huge naturally
sagging breasts : 1. 4 ), ( 1 girl, solo girl : 1. 3 ),
( cinematic view of a wonderful burning castle built in a middle of
a dark wood, wide view, fire and ashes, cloudy night : 1. 3 ),
( medieval plate armour, full body armor suit with chain mail : 1. 3 ),
( crusader knight, cape, hold a shield and very long sword, fighting,
wield a sword : 1. 4 ), ( carrying a big bow on back : 1. 2 ),
chastity belt,']

As implied by the warning, a good part of the prompt is truncated and not input into the diffusion model, and as a result the generated images do not contain the truncated elements such as medieval plate armour or full body armor suit With chain mail or crusader knight.

Images generated with the truncated prompt do not contain truncated elements. Image generated by the author.

This means that we either reduce the prompt’s length by removing words — however as mentioned this will also reduce the amount of detail we can generate.

Fortunately there are some tricks we can use with the diffusers package to get over this restriction, and input long prompts into the pipeline. One trick is to use prompt embeddings.

Use Prompt Embeddings to Encode Long Prompts

Patrick von Platen suggests using prompt embeddings with the diffusers StableDiffusionPipeline.

The first step is to tokenize the prompt and the negative prompt, assuming that the prompt is longer than the negative prompt.

input_ids = pipeline.tokenizer(
prompt,
return_tensors="pt",
truncation=False
).input_ids.to("cuda")


negative_ids = pipeline.tokenizer(
negative_prompt,
truncation=False,
padding="max_length",
max_length=input_ids.shape[-1],
return_tensors="pt"
).input_ids.to("cuda")

For example, the long prompt provided by wnwntk511 on CivitAI are tokenized to a Tensor with 256 elements (which will be truncated to 77 by CLIP):

input_ids


tensor([[49406, 263, 1125, 268, 16157, 267, 1532, 1125, 267, 12066,
267, 949, 3027, 267, 279, 330, 267, 84, 4414, 267,
6323, 1125, 267, 21682, 912, 620, 267, 1994, 9977, 267,
8157, 4353, 267, 8118, 12609, 3575, 267, 16157, 3575, 267,
1215, 22984, 28215, 267, 16575, 4605, 12609, 267, 11444, 537,
29616, 2353, 267, 3077, 27897, 267, 1860, 3073, 267, 3049,
3073, 267, 664, 23752, 3575, 267, 18105, 2812, 21339, 267,
949, 6052, 267, 949, 7068, 267, 25602, 5799, 267, 18768,
6052, 267, 1868, 794, 267, 16179, 281, 272, 269, 275,
2361, 263, 3878, 268, 15893, 1710, 267, 1878, 35016, 9083,
1710, 267, 8037, 1215, 2361, 29802, 537, 736, 8847, 267,
263, 2866, 3095, 2361, 3415, 1400, 1228, 2556, 267, 263,
330, 268, 2852, 8884, 267, 1526, 5915, 267, 8034, 2488,
2361, 263, 3005, 937, 706, 267, 1538, 7072, 267, 12364,
18459, 267, 2699, 30191, 267, 2699, 12995, 6609, 3905, 37637,
281, 272, 269, 275, 2361, 263, 272, 1611, 267, 5797,
1611, 281, 272, 269, 274, 2361, 263, 25602, 1093, 539,
320, 2582, 8405, 3540, 3874, 530, 320, 3694, 539, 320,
3144, 1704, 267, 3184, 1093, 267, 1769, 537, 12587, 267,
13106, 930, 281, 272, 269, 274, 2361, 263, 10789, 5135,
18503, 267, 1476, 1774, 16167, 3940, 593, 3946, 2614, 281,
272, 269, 274, 2361, 263, 40171, 7355, 267, 6631, 267,
3254, 320, 8670, 537, 1070, 1538, 11356, 267, 4652, 267,
820, 5684, 320, 11356, 281, 272, 269, 275, 2361, 263,
9920, 320, 1205, 4040, 525, 893, 281, 272, 269, 273,
2361, 42176, 696, 7373, 267, 49407]], device='cuda:0')

After tokenizing the prompt and negative prompt, the next step is to encode the tokens into embeddings. The result for the prompt is a Tensor of shape [1, 256, 768].

concat_embeds = []
neg_embeds = []
for i in range(0, input_ids.shape[-1], max_length):
concat_embeds.append(
pipe.text_encoder(
input_ids[:, i: i + max_length]
)[0]
)
neg_embeds.append(
pipe.text_encoder(
negative_ids[:, i: i + max_length]
)[0]
)

prompt_embeds = torch.cat(concat_embeds, dim=1)
negative_prompt_embeds = torch.cat(neg_embeds, dim=1)

prompt_embeds.shape


torch.Size([1, 256, 768])

The prompt and negative prompt embeddings can then be fed into the pipeline without being truncated, and the diffusion model will use the full prompt to generate images!

new_img = pipe(
prompt_embeds = prompt_embeds,
negative_prompt_embeds = negative_prompt_embeds,
width = width,
height = height,
guidance_scale = guidance_scale,
num_inference_steps = num_inference_steps,
num_images_per_prompt = 1,
generator = torch.manual_seed(seed),
).images
With prompt embeddings, long prompts will be input into the model without truncation and the image will be generated according to the prompt. Image created by the author.

This time round with prompt and negative prompt embeddings, the diffusion model generates details which were originally truncated by CLIP, such as medieval plate armour or full body armor suit With chain mail or crusader knight!

The solution provided by Patrick von Platen above does not take into consideration however what happens if the negative prompt is longer than the prompt. An improved solution was posted by Andre van Zuydam on GitHub which we recommend readers to refer to (we will not copy the code here!). A modified version of this improved solution was used to generate the image above!

References

  1. https://civitai.com/images/1321964?modelVersionId=105253&prioritizedUserIds=587535&period=AllTime&sort=Most+Reactions&limit=20
  2. https://github.com/huggingface/diffusers/issues/2136

WRITER at MLearning.ai // EEG AI Prediction // Personal AI Art Model

--

--

Y. Natsume

Computer Vision Data Scientist | Physicist | Photographer