Making an AI-Assisted Game in 14 Days: Stable Diffusion (Part 3/4)
Unlimited Power, Sometimes
I made progress with my little game, Chibi Toss. After asking ChatGPT to get user interaction and collision done, I needed images to replace my basic shapes.
To do that, I installed the web app Automatic1111, opened it up on my computer, and ran Stable Diffusion v1.5.
Day 7
Image Generation: The First Attempts
When you first start Automatic1111, you will get a screen like the following:
The basic and intuitive use of Stable Diffusion is text-to-image (txt2img), which is the default tab. Put in a text prompt, get out an image.
But when you run it, you can get funky results like the following:
Sure, it’s essentially what you want, which is already very amazing, but what about those eyes? And those claws?
And, more importantly in my case, what if this photorealistic style is not what I want for my game?
Stable Diffusion was trained on a lot of data to work. A lot. As a result, it can handle an impressive amount of different visual styles, but it does not focus on any one style. If you introduce something which is too specific, like an uncommon item (panda here), game character, or digital artist’s style, you may not get the result you want.
Using Stable Diffusion’s default model, I could get an overall concept correct, but not quite the results I hoped for.
Let’s Improve: Models and Image Websites
In my case, I wanted chibi characters, which meant more of an anime style.
Lucky for me, there are many AI image websites which produce anime images, because Internet. You might have already seen a bunch.
What you might not know, is some websites provide not just AI images, but also checkpoint models.
Checkpoint models are pre-trained Stable Diffusion models to make certain kinds of images.
The terminology is a little confusing — technically they are “weights” which are loaded into the Stable Diffusion model.
A deep learning model, like Stable Diffusion, has several input nodes which are multiplied by other values known as weights. Nodes are arranged in layers, with weights between them. But I believe since the weights are so significant to the model, people call these weights themselves models as a shorthand.
Ok, whatever, I hear you say. Where can I find such models?
There’s a lot of websites which provide thousands of models for free, but the most famous is CivitAI. They provide not only AI images, but models for Stable Diffusion, and even prompts other people have used in their own images.
I tried three different checkpoint models.
In the end, I settled with AnyLoRa.
An anime model and, planning ahead, I thought it would work well when I needed the same chibi to have different facial expressions later on. I had a feeling this would not be easy due to AI in general having problems with consistency.
Prompts: Unlimited Power
I was surprised to see the variations and the difficulty in trying to get even a single image that produced precisely what I wanted.
My goal was mainly a not-blurry anime person, in plain clothes, standing in a neutral pose, with a big head and little body.
Before using the AnyLoRA model, I was getting blurry images with other models, people in odd, distorted poses, or even multiple people in the same image. I found a simple text was best.
My 1st chibi came from the following prompt:
girl, chibi, game sprite, brown hair, brown eyes, blue shirt, blue pants, standing
Not bad! Now, I just needed to remove the background, make it transparent, and crop the sprite.
Photopea: The Clutch
I knew my art background was weak. I needed something like Photoshop, but wanted something free and online that had plenty of help in case I got stuck on something basic.
Enter a very clutch tool Photopea.
I think a digital artist would have a much easier time, but I ended up spending a lot of time in Photopea. This was simply me struggling with basic image manipulation. Remove the background with magic wand, erase any remaining parts, crop, and so on.
Result:
Eh. It wasn’t perfect. There was still a faint outline, but my overall goal was to make something quicker than usual and automate tasks to the AI.
I made 2 more chibis, for 3 total.
I kept the default settings on Automatic1111, and no “negative” prompt. That’s for when you want to de-emphasize certain things. I went with the top “positive” text box you normally use. After generating I repeated my editing steps in Photopea.
2nd chibi prompt:
girl, chibi, game sprite, brown hair, brown eyes, blue shirt, blue pants, standing, pure white background
Result:
3rd chibi prompt:
girl, chibi, game sprite, pink hair, red eyes, white shirt, blue pants, standing, pure white background
Result:
Then I added the images into the game’s assets folder and my code. This required a lot of talking with ChatGPT and then modifying snippets of code it gave me.
Along the way, I thought for the game I should resize the sprites to be smaller. However, I wanted them proportional to their original cropped size in pixels, to avoid distorting, so I used an aspect ratio calculator.
You will notice there are some nonsense words on shirts. It’s from the AI. You could hand-edit and find a way to remove them. You will also notice the AI did not follow my specified clothes color exactly. I was fine with it, but this could be something you need to edit as well.
As it was, I was already spending way more time image editing than I liked. It wasn’t all manual though. In fact, to get those result images I ended up using another feature of Stable Diffusion.
Inpaint Time: Fixing Images
Take this originally generated image straight from Stable Diffusion:
If you look very closely at the top, you can see the chibi has its left ear slightly cut off.
I wasn’t sure how to fix this.
However, I had Stable Diffusion. One of its features other than txt2img is called inpaint. Inpaint attempts to fill in missing or distorted parts of an image based on surrounding pixel values.
I went to the img2img->inpaint tab and set the option “inpaint area -> only masked” in Automatic1111. Otherwise kept default settings. Then I marked the area to mask by filling it in.
You still need to provide a prompt. I put
pink fox ears
This seems great. When it works. I was disappointed to see inpaint was very much hit or miss.
Sometimes it would fix an image. Sometimes it would in fact make an image even worse. It would add smudges or even an extra person if I put in a bad text prompt. I attempted to fix shirts with inpaint, to remove nonsense words, but the edges of words were not removed.
Still, I needed to proceed. Well, now that I had chibis, it was time to put them to bed.
Img2Img: Making a Bed
Other than text-to-image (txt2img), you can also generate images with image-to-image (img2img).
You not only give a text but also your own image to the model. To do it, I went to img2img->Sketch and uploaded a plain white background image. I drew a scribble for a blue bed with a white pillow. Something like:
I ended up changing prompts many times and re-coloring or erasing my sketch.
The results were not great, to be blunt.
I received parts of a bed, beds which still kept scribbles, beds which were more photorealistic than cartoonish, and beds which had people with them.
To be fair, the model was meant for people, not household items.
Eventually, I got a half-decent result with the prompt:
A comfy bed, side view, white pillow and blue blanket, pure white background
For whatever reason, the model decided to give an extra pillow. I awkwardly removed it in Photopea and cropped the sprite, adding it to my code.
Result:
Day 8–12
Background and Menus
Next was generating a background. I adjusted the output image size in stable diffusion to be 1080 px wide and 1920 px height.
The interior of an empty mansion, side view, comfy and cozy
A lot of images didn’t work out. They also didn’t visually match my sprites, which were more cartoonish than realistic. I eventually settled for one.
The next parts needed were assets for the score, a help menu, and a win menu. I wanted these to be AI-generated, but results were bad. Eventually, I chose to make those assets off online images and combining existing images.
Concept LoRAs: Facial Expressions
Now came the most challenging and frustrating part of working with AI. I wanted each chibi to have the same appearance, but different facial expressions. These were to inform the user of what was happening in the game without needing any text to say so.
I decided to use LoRAs. LoRA stands for Low-Rank Adaptation, and is a small model that applies small changes to checkpoint models. You can find them on sites like CivitAI, and they’re used to specify particular images based on a concept.
To get a silly face, I picked a bunch of comic expression LoRA.
At least 3 facial expressions were needed for the game:
- Happy face when selected
- Surprised face when hit a bed
- Sleeping face when resting on a bed
After trial and error, eyes-1–1, eyes-4–4, and eyes-2–2 were chosen.
Then I needed to run inpainting on my chibi sprites, but apply each LoRA to that cropped image.
This turned out to be tricky. Often the expression failed to appear in a recognizable way.
Worse, a lot of dark smudging appeared near the areas where I inpainted. Usually below the eyes. Parts of the eyes and mouth would vanish entirely. And, parts would blur together. I spend a lot of time attempting to figure out why this was happening. The smudging especially I disliked.
The results improved slightly when I made sure to inpaint by marking a mask that went beyond the image’s eyes and mouth. This was to give Stable Diffusion more context into the rest of the image other than the masked part.
The prompts were
Happy face, when selected:
<lora:hotarueye_comic1_v100:1> (chibi:1.4), smile, open mouth
Surprised face, when hit:
<lora:hotarueye_comic4_v200:1> (chibi:1.4), smile, open mouth
Sleep face, when resting:
<lora:hotarueye_comic2_v100:1> (chibi:1.4), smile, open mouth
Notice I had to put emphasis on “chibi”. In Stable Diffusion, each word in a prompt is weighted so the model can decide which part of your prompt to focus on more. Default is 1.0.
Increasing the weight up to 2.0 can increase the likelihood you’ll get that word. On the other hand, decreasing below 1.0 and up to 0.0 makes it less likely that word will be used in the result.
It helped when I increased the weight of the word to 1.4, and indicated this with parentheses. The number was based off other people’s AI image prompts from the expressions LoRA website.
If you look carefully, the appearance is not quite right. Smudging and mouth partly missing, for example.
I think in this case, since the change is only simple black lines, a digital artist could do the job more easily and do a much nicer job. But, using a LoRA may be helpful when dealing with a more complex expression.
Day 13
Character LoRAs
This is all nice, but what if I wanted a sprite to more resemble a particular character?
You can use LoRAs for this too. I wanted 2 more chibis to be based off characters. For this game I wasn’t picky, so I chose more popular figures which might have more chibi-style art already done on them. I went to the CivitAI website and chose a LoRA for vocaloid Hatsune Miku and a LoRA for VTuber Gawr Gura.
Back to txt2img for generation.
<lora:gura:0.6> gawr gura, 1girl, (chibi:1.4), big head, small body, game sprite, black hair, blue eyes, white shirt, blue pants, standing, pure white background
I repeated the process for my previous sprites, editing in PhotoPea, and then applying LoRAs to the cropped sprite using inpainting.
Result:
One more chibi.
<lora:hatsunemiku1:0.6> mikudef, 1girl, (chibi:1.4), big head, small body, game sprite, standing, pure white background
Because maybe her hair would get in the way and look strange when the sprite bounced around the screen, I gave Miku a bad haircut.
Result:
Day 14
The clock was ticking. I made some last edits in Photopea, stuck in the new sprite, and called it a day.
Success?
Well, I finished.
The results were a little bit underwhelming, but a lot of that I think was with myself and not the tools.
Using AI tools effectively is a skill. You need to understand what are the latest and greatest approaches people are taking, but that requires paying attention to other people’s work and then trying them out yourself. And then making mistakes. Like any skill, it takes time and experience to acquire.
On the plus side, I was able to whip something out in a short amount of time that I’m not sure I’d be able to finish at all if I did not use AI. It put me over humps where I might otherwise have gotten discouraged. Life happens. You get sick, you get busy. In my opinion, it definitely helped me be more productive.
Thanks for reading. The last post of this series will summarize my thoughts. It’ll be highlights and future considerations.