How to create fancy artistic text effects using Stable Diffusion

With inpainting

XQ
The Research Nest
Published in
6 min readMar 23

--

Before we get started, I assume that you are already familiar with the following;

  • You know how to run and use various features of Stable Diffusion, locally or on Google Colab.
  • You know how to debug stuff when things don’t work correctly.
  • You have decent experience in programming and understand how to read/write code.
  • You already know all the terminology related to Stable Diffusion.

That said, I will focus only on the critical aspects of a possible algorithm that can get things done for us. I won’t be focusing on the environment setup, installations, etc. This is not a beginner-level tutorial. However, you may casually read to understand the kind of stuff possible with Stable Diffusion.

Problem Statement

Image Source: Adobe

Recently, Adobe revealed their new suite of Generative AI called Firefly. One of the applications was to create stylized text like the above. Can we create similar effects using Stable Diffusion?

When I tried to input the same prompt directly to SD, here’s the result I got.

Stable Diffusion has never been good with texts. One possible solution is to prompt engineer to get the desired effect. For example, you could prompt objects in the shape of the alphabet as a proxy to them and get creative with it. Figuring out the right prompt that works can be tricky. Is there any other method that’s a lot more precise and controllable?

The Approach

There is a straightforward approach to try.

  1. Create a plain text image programmatically using PIL.
  2. Create a mask of this image.
  3. Use the stable diffusion inpainting model to paint the parts of the image filled with text with whatever we imagine.

Let’s try.

First, I created a few helper functions, create_image, create_mask, and create_text.

import torch
from diffusers import StableDiffusionInpaintPipeline, DPMSolverMultistepScheduler
from PIL import Image, ImageDraw, ImageFont
import numpy as np

pipe = StableDiffusionInpaintPipeline.from_pretrained(
"stabilityai/stable-diffusion-2-inpainting", torch_dtype=torch.float16
)

pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")

# Returns a PIL image of text in black at the center of a 512*512 white background
# Note: Fontsize need to be changed for words, letters of different length
# # Single letters, font_size = 500, 5 letter words, font_size = 150
def create_text(text, font_size):
font = ImageFont.truetype("arialbd.ttf", font_size)

# Create a new image with a white background
image = Image.new("RGB", (512, 512), (255, 255, 255))

# Get the size of the text and calculate its position in the center of the image
text_size = font.getbbox(text)
text_x = (image.width - text_size[2]) / 2
text_y = (image.height - text_size[3]) / 2

# A method to create thicker text on the image
# Define the number of shadow layers to create and the offset distance
num_layers = 5
offset = 2

# Draw the text onto the image multiple times with a slight offset to create a shadow effect
draw = ImageDraw.Draw(image)
for i in range(num_layers):
x = text_x + (i * offset)
y = text_y + (i * offset)
draw.text((x, y), text, font=font, fill=(0, 0, 0))

# Draw the final text layer on top of the shadows
draw.text((text_x, text_y), text, font=font, fill=(0, 0, 0))

return image

# Returns a PIL image of the mask of the given image
# param = "letter" masks the letters to inpaint
# paraam = "background" masks the background
def create_mask(image, param):
# Convert the image to grayscale
gray_image = image.convert('L')

# Convert the grayscale image to a numpy array
gray_array = np.array(gray_image)

# Threshold the array to create a binary mask
threshold = 128

# letter -> letter is painted white
# background -> background is painted white
# All the white area is used for inpainting
if param == 'letter':
mask_array = np.where(gray_array > threshold, 0, 255).astype(np.uint8)
if param == 'background':
mask_array = np.where(gray_array > threshold, 255, 0).astype(np.uint8)

# Convert the mask array back to a PIL image
mask_image = Image.fromarray(mask_array)

return mask_image

# Sometimes, outputs are detected as NSFW (false positive). This function disabled the NSFW filter
def safety_checker(images, clip_input):
return images, False

# Saves and returns a PIL image generated by Stable Diffusion
def create_image(prompt, image, mask, file_name):
#image and mask_image should be PIL images.
#The mask structure is white for inpainting and black for keeping as is
pipe.safety_checker = safety_checker
output = pipe(
prompt=prompt,
image=image,
mask_image=mask,
num_inference_steps=50).images[0]
output.save(file_name + ".png")
return output

Let us test with the Stable Diffusion 2 inpainting model. Let’s create a few alphabets and see what we can get.

prompt_1 = "A beautiful blue ocean"
prompt_2 = "Colorful outer space, starry sky"

letters = create_text("V", 500)
letter_mask = create_mask(letters, "letter")
letter_out = create_image(prompt=prompt_1, image=letters, mask=letter_mask, file_name="ocean2")

Here are the outputs with both prompts.

Text image
Mask image
Prompt 1
Prompt 2

So, it works! However, it doesn’t give good results all the time. Sometimes, it outputs just a blank image for some reason. I am unsure what kind of prompt structure works best in this context.

Let’s try a complete word, “HELLO.”

Not bad. I want to try one final experiment for this blog post.

Double Inpainting

What if we do inpainting twice? Once for the letter design and once for the empty background. Can we effectively create a coherent picture with text on it?

Let’s find out.

To do this, I simply create two masks, one for the letters and one for the background. Then, perform inpainting twice respectively.

prompt_1 = "Colorful rainbow paints splashed around"
prompt_2 = "A beautiful and calm sky scenery"

letters = create_text("HELLO", 150)
bg_mask = create_mask(letters, "background")
letter_mask = create_mask(letters, "letter")
letter_out = create_image(prompt=prompt_1, image=letters, mask=letter_mask, file_name="output_letter")
bg_out = create_image(prompt=prompt_2, image=letter_out, mask=bg_mask, file_name="output_bg")

Here’s the result.

Interesting.

I tried a few more prompts and combinations. Sometimes it gives a good result, but often it makes incomplete or blank images. I think the thickness of the font and mask needs to be improved further.

After many different iterations, this is probably one of the best pics I got. However, I think with the right set of prompts, you can get much better results.

Things to explore further

  1. How are the results when newer or more fine-tuned inpainting models are used?
  2. Can we create more complex images with text with the help of object localization and detection? Think, “An image of a signpost in a beautiful forest that says Hello.” To do this, we can create a background image of a beautiful forest with an empty signpost. We can then create a text image with PIL superimposing the coordinates of the signpost. Those coordinates can be obtained by object detection. We can then create a mask of the text in that image and perform an inpainting as required. What kind of results can we get?
  3. To create such images in more resolutions and dimensions.
  4. To use colors to direct the model better. Will it work? Instead of creating text in black, what if we use colors relevant to the prompt? Will that have an impact while inpainting?
  5. Is it possible to create better results by integrating with any other model?

I may try making the second and fourth points here in a future tutorial. #StayTuned

--

--

XQ
The Research Nest

Tech 👨‍💻 | Life 🌱 | Careers 👔 | Poetry 🖊️