Generate Images with Depth Guided Stable Diffusion and Rerun

How to generate images with enhanced depth perception using Depth Guided Stable Diffusion

Andreas Naoum
2 min readMay 13, 2024


Image Generation using Depth Guided Stable Diffusion | Image by Author

This tutorial is a guide focused on visualisation and provides complete code for visualising the image generation of Depth Guided Stable Diffusion with the open-source visualisation tool Rerun.


Depth Guided Stable Diffusion enriches the image generation process by incorporating depth information, providing a unique way to control the spatial composition of generated images. This approach allows for more nuanced and layered creations, making it especially useful for scenes requiring a sense of three-dimensionality.

Logging and visualizing with Rerun

The visualizations in this example were created with the Rerun SDK, demonstrating the integration of depth information in the Stable Diffusion image generation process. Here is the code for generating the visualization in Rerun.


Visualizing the prompt and negative prompt

rr.log("prompt/text", rr.TextLog(prompt))
rr.log("prompt/text_negative", rr.TextLog(negative_prompt))


Visualizing the text input ids, the text attention mask and the unconditional input ids

rr.log("prompt/text_input/ids", rr.BarChart(text_input_ids))
rr.log("prompt/text_input/attention_mask", rr.BarChart(text_inputs.attention_mask))
rr.log("prompt/uncond_input/ids", rr.Tensor(uncond_input.input_ids))

Text embeddings

Visualizing the text embeddings. The text embeddings are generated in response to the specific prompts used while the unconditional text embeddings represent a neutral or baseline state without specific input conditions.

rr.log("prompt/text_embeddings", rr.Tensor(text_embeddings))
rr.log("prompt/uncond_embeddings", rr.Tensor(uncond_embeddings))

Depth map

Visualizing the pixel values of the depth estimation, estimated depth image, interpolated depth image and normalized depth image

rr.log("depth/input_preprocessed", rr.Tensor(pixel_values))
rr.log("depth/estimated", rr.DepthImage(depth_map))
rr.log("depth/interpolated", rr.DepthImage(depth_map))
rr.log("depth/normalized", rr.DepthImage(depth_map))


Log the latents, the representation of the images in the format used by the diffusion model.

rr.log("diffusion/latents", rr.Tensor(latents, dim_names=["b", "c", "h", "w"]))

Denoising loop

For each step in the denoising loop we set a time sequence with step and timestep and log the latent model input, noise predictions, latents and image. This make is possible for us to see all denoising steps in the Rerun viewer.

rr.set_time_sequence("step", i)
rr.set_time_sequence("timestep", t)
rr.log("diffusion/latent_model_input", rr.Tensor(latent_model_input))
rr.log("diffusion/noise_pred", rr.Tensor(noise_pred, dim_names=["b", "c", "h", "w"]))
rr.log("diffusion/latents", rr.Tensor(latents, dim_names=["b", "c", "h", "w"]))
rr.log("image/diffused", rr.Image(image))

Diffused image

Finally we log the diffused image generated by the model.

rr.log("image/diffused", rr.Image(image_8))



Andreas Naoum

AI | Robotics | Apple Enthusiast | Passionate Computer Scientist pursuing an MSc in Autonomous Systems at KTH.