Meta 3D Gen: Text to 3D in One Minute

A 3D Asset Generator that really follows the prompts!

Elmo
Antaeus AR
9 min readJul 3, 2024

--

Some creations with Meta 3D Gen

In the rapidly evolving landscape of 3D content creation, the demand for efficient and high-quality asset generation tools has never been greater and it’s in this context that arrives Meta 3D Gen (3DGen), a novel pipeline that leverages the power of artificial intelligence to transform textual descriptions into compelling 3D assets. Is it able to beat its competitors in this field? Let’s find out together!

  1. Reimagining 3D Asset Creation
  2. Meta 3D Gen: A Two-Stage Approach
  3. Stage I: Meta 3D AssetGen — Sculpting the 3D Form
  4. Stage II: Meta 3D TextureGen — Painting the Canvas of Reality
  5. A Unified Pipeline: Advantages of the Two-Stage Approach
  6. Evaluating Performance: Outperforming Current SoTA Models
  7. Conclusion

Reimagining 3D Asset Creation

The creation of 3D assets, encompassing characters, props, and intricate environments, remains a time-intensive and technically demanding endeavor within the realms of video game development, augmented and virtual reality experiences, and special effects in filmmaking. This process traditionally necessitates a high level of artistic skill and technical expertise, often proving to be a bottleneck in content creation pipelines.

Recognizing this challenge, Meta 3D Gen emerges as a potential game-changer, empowering creators with an AI-powered assistant capable of rapidly generating high-fidelity 3D assets from simple text prompts. This transformative technology holds the potential to democratize 3D content creation, opening up new avenues for personalized user-generated experiences and fueling the development of immersive virtual worlds within the metaverse.

Meta 3D Gen: A Two-Stage Approach

The magic of 3DGen lies in its two-stage pipeline, a carefully orchestrated collaboration between two foundational generative models: Meta 3D AssetGen and Meta 3D TextureGen. This synergistic approach allows 3DGen to achieve an unprecedented level of quality and efficiency.

Meta 3D Gen pipeline

Let’s explore each stage in detail:

Stage I: Meta 3D AssetGen — Sculpting the 3D Form

The first stage, driven by Meta 3D AssetGen (AssetGen), takes the textual prompt as its guiding star and embarks on the creation of the initial 3D asset.

Meta 3D AssetGen overview

This stage is a marvel of multi-view consistent image generation and 3D reconstruction. It operates on the principle that a 3D object can be represented by a set of consistent views from different angles.

From Text to Multi-View Images:

AssetGen’s journey begins by harnessing a pre-trained text-to-image diffusion model. This model, trained on billions of captioned images, is fine-tuned to specifically generate a grid of four images, each showcasing the object described in the text prompt from a different, predetermined viewpoint. These canonical viewpoints are carefully chosen to provide a comprehensive 360-degree representation of the object.

Shaded and Albedo: A Key to Realism and Material Control:

Here, AssetGen introduces a novel concept: generating both shaded and albedo versions of the object for each view. The shaded images depict the object with full lighting effects, capturing the interplay of light and shadow on its surface, providing information about its 3D form. The albedo images, on the other hand, represent the object’s base color without any lighting, revealing its intrinsic surface properties. This dual-channel output is crucial for the next stage, enabling the accurate prediction of physically-based rendering (PBR) materials.

PBR Decomposition: Unlocking Realistic Lighting Interactions:

PBR is a rendering technique that aims to simulate the way light interacts with real-world materials. To achieve this, PBR decomposes materials into their fundamental properties: albedo (base color), metalness (how metallic the material is), and roughness (how smooth or rough the surface is). This decomposition allows for realistic relighting of the 3D object in different virtual environments, as the shader can accurately calculate how light should reflect off the surface based on these properties.

Reconstructing in 3D: From Images to Signed Distance Field:

The multi-view images, now enriched with both shaded and albedo information, are passed to MetaILRM, a powerful 3D reconstruction network. This network predicts the object’s shape in the form of a signed distance field (SDF). An SDF is a volumetric representation where each point in 3D space is assigned a value that represents its distance to the nearest surface of the object. Positive values indicate points outside the object, negative values indicate points inside, and zero values represent points on the surface.

The Advantages of SDF:

The choice of SDF over other representations like occupancy fields offers several advantages:

  1. Well-Defined Surfaces: SDFs naturally lead to smoother, more well-defined surfaces, reducing artifacts during the mesh extraction process.
  2. Direct Depth Supervision: SDFs allow for direct supervision using ground truth depth maps, leading to more accurate geometric reconstruction.
  3. Scalable Rendering: AssetGen leverages the efficient Lightplane kernels, specifically designed for SDF rendering, enabling larger training batch sizes and higher-resolution renders.

The Output of Stage I:

The culmination of Stage I is a 3D mesh, generated by tracing the zero level set of the predicted SDF using the Marching Tetrahedra algorithm (an algorithm that creates a 3D mesh surface from a volumetric representation, such as a SDF). This mesh represents the object’s 3D shape and comes with an initial texture, serving as the foundation for further refinement in the next stage.

Text-to-3D meshes generated by Meta 3D AssetGen along with their PBR decomposition

Stage II: Meta 3D TextureGen — Painting the Canvas of Reality

While Stage I focuses on shaping the 3D form, Stage II, driven by Meta 3D TextureGen (TextureGen), specializes in refining the texture and PBR materials. This last stage can be divided into two phases: generative 3D texture refinement and generative 3D retexturing.

Generative 3D Texture Refinement:

The initial texture produced in Stage I by AssetGen often lacks the sharpness and detail needed for truly high-fidelity assets. TextureGen addresses this by employing a specialized text-to-texture generator, similar to the one in Stage I, but with a focus on UV space.

This generator analyzes the 3D mesh and the original text prompt, generating multiple consistent views of the textured object. These views are then projected onto the UV map of the mesh, a 2D representation of the 3D surface that defines how the texture should be mapped onto the object. This results in multiple partial texture maps, each capturing information from a specific viewpoint.

Meta 3D TextureGen overview

A dedicated UV-space generator network takes these partial texture maps and the text prompt as input, meticulously fusing them into a single, consolidated texture. This fusion process enhances the texture’s quality and detail, ensuring consistency across different views and preserving the semantic alignment with the original text prompt.

Generative 3D (Re)Texturing:

TextureGen goes beyond mere refinement; it has the power to generate entirely new textures from scratch for any 3D mesh. This retexturing capability allows users to explore different material appearances for the same 3D shape, significantly expanding creative possibilities. Imagine taking the spaceship from Stage I and retexturing it to resemble a sleek, chrome-plated cruiser, or a rugged, battle-scarred vessel, all based on new textual descriptions.

For each column images changing prompt

TextureGen achieves this by combining multi-view image generation, UV space inpainting, and an optional texture enhancement network to produce high-resolution, detailed textures that are both visually appealing and semantically aligned with the user’s vision.

A Unified Pipeline: Advantages of the Two-Stage Approach

The brilliance of 3DGen lies not just in the individual capabilities of AssetGen and TextureGen, but in their integration into a unified pipeline. In fact, this integration allows them to complement each other, overcoming individual limitations and achieving a higher level of overall performance.

Harnessing Synergy and Complementarity:

The two-stage approach allows 3DGen to work with three complementary representations of the 3D object:

  1. View Space: Generating multiple consistent views provides a comprehensive understanding of the object’s appearance from various angles.
  2. Volumetric Space: The SDF representation in AssetGen excels in producing high-quality 3D shapes with smooth, well-defined surfaces.
  3. UV Space: TextureGen leverages the UV map to generate high-resolution, detailed textures that accurately wrap around the 3D mesh.

Addressing Limitations:

This collaborative approach addresses specific limitations of each individual model:

  1. Enhanced Texture Quality: TextureGen, as a specialized text-to-texture generator, significantly improves the quality and detail of the textures compared to AssetGen’s initial output.
  2. Geometric Conditioning: TextureGen’s ability to leverage the 3D shape information from AssetGen allows for the generation of more consistent multi-view images, resulting in superior textures.
  3. UV Map Compatibility: By incorporating a dedicated network from AssetGen for texture fusion, 3DGen handles the inconsistencies that arise from automatically generated UV maps, ensuring to create really good textures.

Evaluating Performance: Outperforming Current SoTA Models

To assess the capabilities of 3DGen, extensive evaluations were conducted, pitting it against leading industry solutions for text-to-3D generation. These evaluations were structured around two key aspects:

1. Prompt Fidelity:

The ability to accurately translate a text prompt into a 3D asset is paramount. User studies, involving both general audiences and seasoned 3D artists, consistently ranked 3DGen higher than competitors in terms of prompt fidelity. This means 3DGen demonstrates a superior ability to capture the essence of the textual description and generate 3D assets that closely align with the user’s intent.

2. Visual Quality:

Beyond fidelity, the visual appeal of the generated assets is crucial. A/B testing, a method for comparing two versions of something to see which performs better, revealed that 3DGen consistently outperforms its rivals in overall visual quality, geometry accuracy, and texture detail. This superior performance was particularly noticeable when handling complex prompts, showcasing 3DGen’s prowess in tackling intricate designs.

Quantitative and Qualitative Comparisons:

The following table provides a comparative overview of 3DGen and industry baselines, highlighting key features and approximate generation times, divided for the first stage only and for the overall generation.

Overview of Industry Baselines for Text-to-3D Generation

The following table showcases 3DGen’s superiority in prompt fidelity.

User Studies: Prompt Fidelity (Higher is Better)

A qualitative comparison is represented in the following image, where we can see that Meta 3D Gen is really good at following prompts.

Qualitative comparison of text prompt fidelity

Beyond its superior quality, 3DGen distinguishes itself through its remarkable speed. With it, generating a complete 3D asset, including texture and PBR materials, typically takes under a minute, which is significantly faster than many existing solutions that often require several minutes or even hours to produce comparable results. This speed advantage makes 3DGen particularly well-suited for interactive design workflows and rapid prototyping, enabling creators to iterate on ideas quickly and efficiently.

Conclusion

Concluding, what to say? The ability of Meta 3D Gen to generate high-quality, customizable 3D assets from simple text descriptions, coupled with its remarkable speed and efficiency, positions it as a groundbreaking tool for a wide range of applications.

However I think that we can expect in the future the following enhancements:

  • Increased Realism: Pushing the boundaries of visual fidelity, incorporating even more intricate details and nuanced lighting effects.
  • Enhanced Control: Providing users with finer-grained control over the generative process, enabling them to shape specific aspects of the 3D asset.
  • Expanded Functionality: Exploring new capabilities such as animation, rigging, and seamless integration with existing 3D modeling software.

This is just the beginning of a transformative journey that promises to reshape the landscape of 3D design! Stay tuned!!!

Ah! This is the article written for my website… Feel free to visit it and if you want… Follow me, many thanks!

--

--