Diffusion models are zero-shot 3D character generators, too

memmaptensor
10 min readJun 13, 2023

--

Generating 3D characters by marrying ControlVideo, GroundingDINO, SegmentAnything, and nvdiffrec

Abstract

Ever since the release of the seminal paper Denoising Diffusion Probabilistic Models (https://arxiv.org/abs/2006.11239), image generators of this class have been improving to the point where the quality of the generated images beat GANs on multiple metrics and are indistinguishable from real images.

Along with NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis (https://arxiv.org/abs/2003.08934), and the subsequent release of Instant Neural Graphics Primitives with a Multiresolution Hash Encoding (https://arxiv.org/abs/2201.05989), there now exists a way to turn a sparse set of images for an object on multiple views into high quality renderings of said objects in 3D.

However, as promising as the radiance field obtainable by training a NeRF model are (either using the original implementation or the InstantNGP backbone for fast training), extracting a usable mesh from it is extremely resource-intensive, yields noisy results, and destroys all lighting and material data. This is due to the fact that NeRF models and its derivatives “cheat” their way through novel view synthesis by only parameterizing the RGB colors and density of a point in a 3D scene given some camera pose.

While representing scenes as a Neural Volume does have the advantage of essentially “baking” in lighting data, it does not actually do any explicit calculations to at least approximate the BRDF on 3D surfaces (bidirectional reflectance distribution function). All this means in practice is there lies an ambiguity in what the lighting conditions and properties of the surfaces are, as NeRFs overlook this very important part in traditional PBR (Physically based rendering)

Luckily, work has been done to sort out the problem of not being able to extract meshs and materials from NeRF and NeRF-derived models. Extracting Triangular 3D Models, Materials, and Lighting From Images (https://paperswithcode.com/paper/extracting-triangular-3d-models-materials-and) is one such work — the technique involves first training up a Neural Volume using NeRFs, reconstructing the 3D surfaces using DMTet (https://nv-tlabs.github.io/DMTet/), and applying differential rendering to the model via nvdiffrast (https://nvlabs.github.io/nvdiffrast/). Both DMTet and nvdiffrast are differentiable stages, so a joint gradient-based optimization of both models is used. This results in high-quality 3D meshes along with PBR materials for relightable 3D objects without requiring any modification to the outputs.

Leveraging the work of ControlVideo: Training-free Controllable Text-to-Video Generation (https://arxiv.org/abs/2305.13077), GroundingDINO, SegmentAnything, nvdiffrec, an attempt is made to outperform NeRF-DDIM models (such as stable-dreamfusion, https://github.com/ashawkey/stable-dreamfusion) in the generation of asset-quality 3D characters.

Problem statement

Let’s first explore the need for generated 3D characters.

3D models are used to portray real-world and conceptual visuals for art, entertainment, simulation and drafting and are integral to many different industries, including virtual reality, video games, 3D printing, marketing, TV and motion pictures, scientific and medical imaging and computer-aided design
https://www.techtarget.com/whatis/definition/3D-model

Right, so the use of 3D models and 3D characters are tied to multiple fields in the digital world. Let’s take video games as our main focus point. (https://truelist.co/blog/gaming-statistics/)

Number of Video Gamers in the World From 2015 to 2024
  • Approximately 3 billion people worldwide play video games. (Marketer)
  • 83% of video game sales happen in the digital world. (Global X ETFs)
  • In 2021, the consumer spending in the video gaming sector was $60.4billion in the US. (PR Newswire, Newzoo)
  • Around 85% of all gaming revenue comes from free-to-play games. (WePC, TweakTown)
  • There were around 14.1 billion mobile game downloads in the Q1 of 2021. (Statista)
  • By 2025, the PC gaming sector alone will accumulate $46.7 billion. (Statista)
  • People between 18 and 34 comprise 38% of gamers globally. (Statista)

Obviously, an important part of most video games are the characters.

And consider this,

Based on type, the 3D segment accounted for 84.19% of the game engines market share in 2019 and is estimated to witness significant growth through 2027. The 3D type is extensively leveraged in games or role modeling, scenario modeling, 3D engine, and particle system.
https://www.businesswire.com/news/home/20201223005239/en/Global-Game-Engines-Market-Report-2020-3D-Segment-Accounted-for-84.19-of-the-Market-in-2019---Forecasts-to-2027---ResearchAndMarkets.com

There is a massive demand for character designers and 3D artists alike.

3D Character Workflow (https://www.3dart.it/en/3d-character-workflow-for-beginners-tutorial/)

Also consider the workflow/pipeline of 3D character creation (https://stepico.com/blog/guide-to-3d-character-modeling/):

  • Concepting
  • Blocking
  • Sculpting
  • Retopology
  • UV unwrapping
  • Baking
  • Texturing
  • Rigging & Skinning
  • Animation
  • Rendering

Each step takes hard work by specialists to complete. If we can somehow automate this process, or at the least, a part of it — that would save a lot of development resources and open up 3D character creation to more people.

Metrics and baselines

We’ve seen how traditional 3D character creation has multiple stages. Even though this ultimately results in very high quality assets that are used in films, video games, marketing, VR, etc. It does take multiple workdays to weeks in order to complete them.

How about existing Text-to-3D options?

DreamFusion, Released by Google Research in 2022, utilizes Imagen as a prior for optimizing a NeRF MLP. Though, we can not test it due to it being closed-source. A reimplementation would also be difficult as Imagen is a pixel-space diffusion model, making it take huge amounts of compute resources to run.

What are its results?

— text “masterpiece, best quality, 1girl, slight smile, white hoodie, blue jeans, blue sneakers, short blue hair, aqua eyes, bangs” \
— negative “worst quality, low quality, logo, text, watermark, username” \
— hf_key rossiyareich/abyssorangemix3-popupparade-fp16 \
— iters 5000
(implementation from https://github.com/ashawkey/stable-dreamfusion)

Shap-e, Released by OpenAI in 2023, utilizes and encoder-decoder architecture, where the encoder is trained to encode a 3D point cloud along with spacial coordinates into implicit parameter functions for the decoder. The decoder, then, is an implicit NeRF MLP trained to output a signed distance field.

What are its results?

a figurine of a girl
a girl

Publicly available Text-to-3D models are either too experimental or don’t produce great results.

https://github.com/ashawkey/stable-dreamfusion, explains the failed results

Data collection and cleaning

Our character generation pipeline can be separated into two steps: Text-to-video and Video-to-3D. Since we’re attempting to synthesize data for a NeRF model. Our best option would be with Diffusion Probabilistic Models. Stable Diffusion (implementation from huggingface/diffusers) is one such model.

We also utilize a ControlNet-like technique for image conditioning. And so, to generate ControlNet images, we base our rendering off a base mesh.

OpenPose conditioning image
SoftEdge HED conditioning image

We render 100 frames in blender for the same character in the same A-pose with a different camera view each time.

Then, to synthesize the dataset required for training nvdiffrec, we utilize a novel consistent generation technique (which also happens to be our main contribution)

To understand the functioning principle of the techniques used, we should first take a step back to look at how pixel-space diffusion models work —

DDPM — Diffusion Models Beat GANs on Image Synthesis (Machine Learning Research Paper Explained) by Yannic Kilcher

Latent diffusion models applies the diffusion process in latent space; images are first encoded into the latent space via a VAE encoder, and the resulting output from the diffusion process gets decoded via the VAE decoder. Though, while running text-to-image inference, only the VAE decoder is used.

In an attempt to better control the results outputted by the diffusion model, we take inspiration from ControlVideo: Training-free Controllable Text-to-Video Generation (https://arxiv.org/abs/2305.13077) and adopt a similar technique.

Where the original paper applies temporal inflation to the Conv2D and self-attention layers within the UNet noise prediction model of Stable Diffusion to be able to feed in the conditions from other frames, we do the same. Our deviation from the original implementation is to instead feed in a fixed amount of latent codes (a maximum of 3, in our case) to alleviate memory restrictions.

Only the first frame and the frame before is used in our cross-frame attention mechanism. We found both our results and the original implementation of sparse-causal-attention to be overall consistent, while still struggling with finer details in the images; however, this is still a limitation of DPMs that is expected and yet to be solved.

Exploratory data analysis

The base ControlNets used are:

  • lllyasviel/control_v11p_sd15_openpose
  • lllyasviel/control_v11e_sd15_ip2p
  • lllyasviel/control_v11p_sd15_softedge

We found this combination to be the best for consistency (though there could still be some improvements)

We also merged AOM3 (AbyssOrangeMix3) with Pop Up Parade with a ratio of 0.5 for our base model, and utilize a NAI-derived VAE weights (anything-v4.0-vae) for our VAE. The important part is the VAE must produce no NaNs, or the entire generation is wasted.

Modeling, validation, and error analysis

Getting more in depth to our modifications and deviations from the implementation of ControlVideo

Firstly, we adapt a DPMSolverMultistepScheduler with order=2 to work with the implementation. This cuts generation time by 60% as we’d only need 20 sampling steps (as supposed to 50 DDIM sampling steps used in the paper)

Secondly, we removed the RIFE (Real-Time Intermediate Flow Estimation for Video Frame Interpolation) model. Though it slightly improves the flickering, it does more harm by making the image unclear and desaturated.

Lastly, we modified the denoising loop to only attend to the first latent code and the latent code of the previous frame

for i, t in enumerate(timesteps):
torch.cuda.empty_cache()

# Expand latents for CFG
latent_model_input = torch.cat([latents] * 2)
latent_model_input = self.scheduler.scale_model_input(
latent_model_input, t
)
noise_pred = torch.zeros_like(latents)
pred_original_sample = torch.zeros_like(latents)

for frame_n in range(video_length):
torch.cuda.empty_cache()

if frame_n == 0:
frames = [0]
focus_rel = 0
elif frame_n == 1:
frames = [0, 1]
focus_rel = 1
else:
frames = [frame_n - 1, frame_n, 0]
focus_rel = 1

# Inference on ControlNet
(
down_block_res_samples,
mid_block_res_sample,
) = self.controlnet(
latent_model_input[:, :, frames],
t,
encoder_hidden_states=frame_wembeds[frame_n],
controlnet_cond=[
cnet_frames[:, :, frames]
for cnet_frames in controlnet_frames
],
conditioning_scale=controlnet_scales,
return_dict=False,
)
block_res_samples = [
*down_block_res_samples,
mid_block_res_sample,
]
block_res_samples = [
b * s
for b, s in zip(block_res_samples, controlnet_block_scales)
]
down_block_res_samples = block_res_samples[:-1]
mid_block_res_sample = block_res_samples[-1]

# Inference on UNet
pred_noise_pred = self.unet(
latent_model_input[:, :, frames],
t,
encoder_hidden_states=frame_wembeds[frame_n],
cross_attention_kwargs=cross_attention_kwargs,
down_block_additional_residuals=down_block_res_samples,
mid_block_additional_residual=mid_block_res_sample,
inter_frame=False,
).sample

# Perform CFG
noise_pred_uncond, noise_pred_text = pred_noise_pred[
:, :, focus_rel
].chunk(2)
noise_pred[:, :, frame_n] = noise_pred_uncond + guidance_scale * (
noise_pred_text - noise_pred_uncond
)

# Compute the previous noisy sample x_t -> x_t-1
step_dict = self.scheduler.step(
noise_pred[:, :, frame_n],
t,
latents[:, :, frame_n],
frame_n,
**extra_step_kwargs,
)
latents[:, :, frame_n] = step_dict.prev_sample
pred_original_sample[:, :, frame_n] = step_dict.pred_original_sample

We then train nvdiffrec with the following parameters:

{
"ref_mesh": "data/ngp",
"random_textures": true,
"iter": 5000,
"save_interval": 100,
"texture_res": [ 2048, 2048 ],
"train_res": [1024, 768],
"batch": 2,
"learning_rate": [0.03, 0.01],
"ks_min" : [0, 0.08, 0.0],
"dmtet_grid" : 128,
"mesh_scale" : 2.1,
"laplace_scale" : 3000,
"display": [{"latlong" : true}, {"bsdf" : "kd"}, {"bsdf" : "ks"}, {"bsdf" : "normal"}],
"background" : "white",
"out_dir": "output"
}
From left to right: combined, ground truth, envmap, albedo, depth, normals

After 5000 iterations, we get the following results:

MSE: 0.00283534

PSNR: 25.590

5504 vertices

9563 texcoords

5504 normals

11040 faces

Next, we tried training InstantNGP for 4000 iterations

As expected, novel view synthesis with InstantNGP yields better results at expected angles, however, when viewed from an extreme angle, the results tend to be inconsistent.

Computed R-Precision scores with sentence-transformers/clip-ViT-B-32 confirms our findings.

“a 3d model of a girl”

0.34226945 — Output from LDM

0.338451 — InstantNGP

0.3204362 — nvdiffrec

Deployment

A colab notebook is available, running with 3 ControlNet modules results in a peak vram usage of 14GiB. Generation takes 2.5h total.

--

--