Stable Diffusion + ControlNet + Texture Projection Workflow with Blender

5 min readMay 17, 2023

Workflow for Stable Diffusion + ControlNet

I’ve been experimenting lately with image generation with Dreamstudio and Midjourney, which really inspired me to explore game development again. I’m more of a coder and no graphics artist myself and so I feel these advances in AI would potentially allow me to express myself in a visual medium such as game development. It also feels more ethical in a hobby programmer context, since I would never be able to afford to hire artists myself so I don’t take anyone's livelihood away. Although I suppose that is a bigger area of discussion.

Diving deeper into the subject I realized that while Midjourney produces some amazing images, their utility in game development is somewhat more limited because of a lack of consistency and limited control over perspective and viewing angle.

Stable Diffusion models, including paid services such as Midjourney or Dreamstudio are given a text prompt and generate an image from it, and while you may describe your “barrels and crates” in such prompts as “isometric, orthographic perspective”, it lacks the fine-grained control that would be necessary if you ever hoped to use those models to create game assets.

This lack of control in Stable Diffusion is something that has very recently been addressed by ControlNet. This new model extends stable diffusion and provides a level of control that is exactly the missing ingredient in solving the perspective issue when creating game assets.

ControlNet offers a collection of models to add such control, for instance canny edges, scribbles, depth and normal maps and human poses.

Workflow

This article is just designed to document some workflow that I was experimenting with. In this example I’m working on a background image in fixed isometric+orthographic perspective (54.736 degrees by 45 degrees), similar to the ones you might seen in older CRPG’s which used pre-rendered images.

Through some searching I found a reference for something I wanted to create, a medieval tavern type interior. I quickly sketched it in Blender:

Next I rendered a depth map of the scene, I actually decided to render the crates and barrels separately since it didn’t work so well before while it was all together. This is also has the benefit of giving you more fine-grained control over the prompting.

For the depth maps I used these compositing node setup, you also need to enable View Layer Properties → Passes → Data → Z:

Blender compositing nodes for depth maps.

I scaled up the barrels and crates, since the important information is the orientation of them and the camera:

Depth maps of the scene, crates and barrels.

Next I baked the normal maps as well, for this I used this material setup:

The nodes are necessary to render the normal maps in the format expected by ControlNet. I’m not 100% sure if this is a correct setup. I found the node setup mentioned in this Blender Artists thread.

Normal maps of the scene, crates and barrels.

To use Stable Diffusion and ControlNet I used AUTOMATIC1111’s amazing stable-diffusion-webui and Mikubill’s sd-webui-controlnet extension. I used the SD model AyoniMix v6 and the ControlNet 1.1 weights.
To run the webui with Docker I found this great docker compose setup.

I did not spend a lot of time tinkering with the hyperparameters, I used the Euler sampling method, Sampling steps of 15, Width/Height of 768 and a CFG Scale of 7.

Stable Diffusion + ControlNet generated images.

I’m no prompt expert, but just for reference sake I used these prompts:

Building:
medieval tavern, support beams, stone floor, wood walls, building interior, interior, diagram overlook, thunderstorm, isometric cutaway, art by artist, 3d render, stylized, intricate, 4k uhd, gradients, (centered:1.5), ambient occlusion, (soft shading:0.7), view from above, angular, isometric, orthographic, greg rutowski, square enix, unreal engine 5, FXAA
Barrels:
wood barrels, wooden barrels, oak barrel, vertical wood, diagram overlook, thunderstorm, isometric cutaway, art by artist, 3d render, stylized, intricate, 4k uhd, gradients, (centered:1.5), ambient occlusion, (soft shading:0.7), view from above, angular, isometric, orthographic, greg rutowski, square enix, unreal engine 5, FXAA
Crates:
old, wooden crates, metal handles, storage crate, dark oak crate, storage boxes, ((metal frames)), rusty metal, diagram overlook, thunderstorm, isometric cutaway, art by artist, 3d render, stylized, intricate, 4k uhd, gradients, (centered:1.5), ambient occlusion, (soft shading:0.7), view from above, angular, isometric, orthographic, greg rutowski, square enix, unreal engine 5, FXAA
Negative Prompt:
shadows, torch, fire, lamp, light, cartoon, zombie, disfigured, deformed, b&w, black and white, duplicate, morbid, cropped, out of frame, clone, photoshop, tiling, cut off, patterns, borders, (frame:1.4), symmetry, signature, text, watermark, fisheye, harsh lighting

Finally we can use texture projection to create UV mappings using the generated images directly. Which gives us a textured scene (from the camera angle).

It’s even possible to light the scene:

I clearly didn’t spend much time on prompt engineering nor lighting in Blender (none at all actually), but you get the picture. I think this actually produces some very interesting results. Although of course in this example it could almost be argued that it might’ve been easier to just drag and drop some texture materials from an asset library.

Stable Diffusion + ControlNet + Texture Projection Workflow with Blender

Workflow

Written by mattzq