Beyond the Surface: Advanced 3D Mesh Generation from 2D Images in Python

Abdul Rehman
Red Buffer
Published in
17 min readFeb 16, 2024

Reconstructing a 3D object from 2D images is a very interesting task and has many real-world applications, but it is also very challenging especially for beginners to achieve high quality and map the 2D image points into 3D world space accurately. To understand the basics of coordinate systems in different 3D frameworks and before we dive in, check out our older article, Mastering 3D Spaces: A Comprehensive Guide to Coordinate System Conversions in OpenCV, COLMAP, PyTorch3D, and OpenGL. In this article, we will try to reconstruct a 3D mesh of a Standford bunny from its rendered images in Python.

Overview

Before we start implementation, lets walk through some of the major concepts we need to know for 3D reconstruction.

Camera Intrinsics and Extrinsics

  • Intrinsic Parameters: These are internal camera parameters such as focal length, principal point, and lens distortion.
  • Extrinsic Parameters: These parameters define the position and orientation of the camera in 3D space.

The extrinsic matrix usually holds a 4x4 arrangement, including a 3x3 rotation matrix and a translation vector. While the intrinsic matrix, which usually has a 3x3 format describing focal lengths and camera projection center, takes on a 4x4 shape in PyTorch3D.

Camera Intrinsic and Extrinsic matrices

For an in-depth understanding of camera parameters, consult this article by Aqeel Anwar.

Depth Estimation

Depth estimation is the process of figuring out how far objects are from a camera or observer. We will be using PyTorch3D fragments zbuf to get the z-coordinate (depth) of the bunny rendered images. You can also use different models to predict a depth of an image using camera intrinsics such as IronDepth and MiDaS.

MiDaS Depth Estimation

Back Projection

Back projection is a process in computer vision where a 2D point from an image is transformed into 3D coordinates in the world space. This involves using information like camera intrinsic parameters (focal length, principal point), camera extrinsic parameters (rotation and translation), and depth information of the point in the 2D image. By considering these parameters, back projection helps map a pixel in a 2D image back into its corresponding 3D position in the real-world space.

In our case, we will be performing back-projection using the PyTorch3D unproject_points function.

Mesh

Think of a polygon mesh in 3D computer graphics as the virtual skin of an object. It is made up of points called vertices, connected by lines called edges, and together they form flat surfaces known as faces. Imagine these faces as tiny triangles covering the object.

Example of a low poly triangle mesh representing a dolphin.

To create faces in python, we would use the triangulation method by creating upper and lower triangles from the list of vertices.

Example of upper and lower triangle: here a, b, c, and d represent the vertices.

Environment Setup

To get started, download the YML file which I have created enlisting all the necessary packages required to follow along this project.

We utilized an AWS SageMaker notebook instance equipped with a GPU. Depending on your system requirements, you may need to adjust certain packages. For instance, if you are working with a CPU, install the CPU-compatible version of PyTorch3D instead of the GPU-enabled variant.

Creating the Environment: After downloading, open your command line interface and navigate to the directory containing the YML file. Create a new environment with this command:

conda env create -f environment.yml

Activating the Environment: Once created, activate your new environment:

conda activate 3d_env

Verifying the Installation: To ensure all packages are installed correctly, list them with:

conda list

Code Workflow

Our approach involves the following key steps:

  • Load the bunny obj file and define several camera locations around the bunny with a fixed radius.
  • Render the mesh to generate 2D images.
  • Estimate depth for each image.
  • Unproject the 2D image into world space using depth values.
  • Triangulate the points in the world space into vertices and faces.

Imports

import os
import json
import torch
import trimesh
import numpy as np
import open3d as o3d
from tqdm import tqdm
from PIL import Image
from typing import Callable, List, Optional, Tuple
from pytorch3d.io import load_objs_as_meshes, load_obj

import pytorch3d
from pytorch3d.structures import Meshes
import pytorch3d.utils
from pytorch3d.renderer import (
FoVPerspectiveCameras,
PointLights,
Materials,
RasterizationSettings,
MeshRenderer,
MeshRasterizer,
HardPhongShader,
TexturesUV,
TexturesVertex,
Textures
)

Set device

if torch.cuda.is_available():
device = torch.device("cuda:0")
torch.cuda.set_device(device)
else:
device = torch.device("cpu")

Rendering Setup

In this code snippet, we’re setting up essential parameters for rendering our 3D model:

# Configuring Rendering Parameters
img_resolution = (256, 256)

raster_settings = RasterizationSettings(
image_size=img_resolution,
bin_size=None,
blur_radius=0.0,
faces_per_pixel=1,
)

lights = PointLights(device=device, location=[[-2.0, -2.0, -5.0]])

materials = Materials(
device=device,
specular_color=[[0.0, 0.0, 0.0]],
shininess=0.0
)

rasterizer=MeshRasterizer(raster_settings=raster_settings)

# Set up a renderer.
renderer = MeshRenderer(
rasterizer=rasterizer,
shader=HardPhongShader(device=device, lights=lights)
)
  • Image Resolution: Defines the size of the rendered images.
  • Rasterization Settings: Determines how the mesh is converted into a 2D image, including details like blur radius and faces per pixel.
  • Lighting and Materials: Configures the lighting and material properties, which affect how light interacts with the surfaces of the 3D model.
  • Rasterizer and Renderer: The rasterizer converts the 3D mesh into 2D images, and the renderer applies lighting and shading effects based on the defined settings.

Loading and Preparing the Mesh

After setting up the rendering environment, the next step is to load and prepare the 3D mesh of the Stanford Bunny. This process involves reading the mesh data, normalizing it, and setting up its texture.

def load_mesh(obj_file_path, device="cuda"):
mesh = load_objs_as_meshes([obj_file_path], device=device)
verts, faces = mesh.get_mesh_verts_faces(0)
texture_rgb = torch.ones_like(verts, device=device)
texture_rgb[:, 1:] *= 0.0 # red, by zeroing G and B
mesh.textures = Textures(verts_rgb=texture_rgb[None])

# Normalize mesh
verts = verts - verts.mean(dim=0)
verts /= verts.max()

# This updates the pytorch3d mesh with the new vertex coordinates.
mesh = mesh.update_padded(verts.unsqueeze(0))
verts, faces = mesh.get_mesh_verts_faces(0)

return mesh, verts, faces
  • Loading the Mesh: The load_objs_as_meshes function reads the OBJ file containing the Stanford Bunny mesh and loads it onto the specified device (GPU or CPU).
  • Texture Assignment: A uniform texture is created and assigned to the mesh. In this case, it’s a red texture, achieved by zeroing the green and blue channels.
  • Mesh Normalization: Normalizing the vertices of the mesh ensures that it’s centered and scaled appropriately, which is crucial for consistent rendering and reconstruction.
  • Updating the Mesh: The mesh object is updated with the new vertex coordinates post-normalization.
mesh_file_path = "stanford-bunny.obj"
mesh, verts, faces = load_mesh(mesh_file_path)

print(
f"Loaded Mesh from : {mesh_file_path}"
f"\nVertices: {verts.shape}"
f"\nFaces: {faces.shape}"
)
  • Loading the Mesh: The load_mesh function is called with the path to the Stanford Bunny OBJ file. This function returns the mesh along with its vertices and faces.
  • Output Information: The print statement confirms the successful loading of the mesh. It also provides information about the number of vertices and faces in the mesh.

Visualizing the Mesh as a Point Cloud

After loading the mesh, a helpful next step is to visualize it to confirm its structure and integrity. This is done by plotting the vertices as a point cloud, providing a visual representation of the mesh.

def plot_pointcloud(
vertices,
alpha=.5,
title=None,
max_points=10_000,
xlim=(-1, 1),
ylim=(-1, 1),
zlim=(-1, 1)
):
"""Plot a pointcloud tensor of shape (N, coordinates)
"""
vertices = vertices.cpu()

assert len(vertices.shape) == 2
N, dim = vertices.shape
assert dim==2 or dim==3

if N > max_points:
vertices = np.random.default_rng().choice(vertices, max_points, replace=False)
fig = plt.figure(figsize=(6,6))
if dim == 2:
ax = fig.add_subplot(111)
elif dim == 3:
ax = fig.add_subplot(111, projection='3d')
ax.set_zlabel("z")
ax.set_zlim(zlim)
ax.view_init(elev=120., azim=270)

ax.set_xlabel("x")
ax.set_ylabel("y")

ax.set_xlim(xlim)
ax.set_ylim(ylim)

ax.scatter(*vertices.T, alpha=alpha, marker=',', lw=.5, s=1, color='black')
plt.show(fig)


plot_pointcloud(verts)
  • Function Definition: plot_pointcloud takes the vertices of the mesh and plots them. Parameters like alpha, max_points, and axis limits (xlim, ylim, zlim) control the appearance of the plot.
  • Handling Large Point Sets: If the number of vertices exceeds max_points, a random subset is selected to keep the plot manageable and clear.
  • Plotting: The function plots in 2D or 3D based on the dimensionality of the input vertices. It uses Matplotlib for visualization.
  • Visualizing the Bunny Mesh: Finally, plot_pointcloud is called with the vertices (verts) of the Stanford Bunny mesh, resulting in a visual representation of the point cloud.
3D Point Cloud Visualization of the Stanford Bunny: A graphical representation showcasing the detailed structure and spatial complexity of the mesh model.

Camera Locations and Combined Visualization

With the mesh successfully visualized, the next code snippet adds simulated camera positions to the scene. These camera points are crucial for rendering the mesh from various angles for our 3D reconstruction.

mesh_center = torch.tensor([0.0, 0.0, 0.0], device=device)

points = generate_camera_locations(mesh_center, 3, 100)

plot_pointcloud(torch.cat((points, verts), dim=0), xlim=(-3, 3), ylim=(-3, 3), zlim=(-3, 3))
  • Center of the Mesh: We define a point, mesh_center, which acts as a reference for generating camera locations around the mesh.
  • Camera Locations: generate_camera_locations function creates a set of points in a spherical distribution around the mesh_center. The number of points is specified (in this case, 100), representing different camera views.
  • Combined Plot: The plot_pointcloud function is then used to plot both the vertices of the Stanford Bunny mesh (verts) and the camera locations (points). This combined plot helps to visualize not just the mesh, but also the relative positions of the cameras in 3D space.
3D Visualization of Stanford Bunny and Camera Positions: The plot illustrates the Stanford Bunny mesh in black, with surrounding camera points indicating potential viewpoints for image capture.

Setting Up Camera Views

This snippet is crucial for setting up camera views that ‘look at’ the mesh from the generated camera locations, ensuring that our virtual cameras are aimed at the mesh center for optimal rendering.

R_pt3d, T_pt3d = get_look_at_views(points, mesh_center.repeat(points.shape[0], 1))
  • Rotation and Translation Matrices: The get_look_at_views function computes the rotation (R_pt3d) and translation (T_pt3d) matrices for each camera position, so that each camera is oriented to face the center of the mesh.
  • Target Point Replication: mesh_center.repeat(points.shape[0], 1) replicates the mesh center coordinates for each camera point, ensuring that all cameras are directed towards the same central point of the mesh.

Defining Camera Intrinsics

The following code snippet specifies the intrinsic parameters of the cameras, which are essential for rendering the 2D images from the 3D mesh. These parameters include focal length, principal point, and other camera-specific characteristics.

K_pt3d = torch.tensor([[0.7, 0., 0.5, 0.],
[0., 0.7, 0.5, 0.],
[0., 0., 0., 1.0],
[0., 0., 1., 0.]], device=device)

Matrix K_pt3d: This tensor represents the camera's intrinsic matrix, which includes:

  • Focal lengths along the x and y axes (0.7 in this case).
  • The optical center of the camera, often assumed to be at the image’s center (0.5, 0.5 here).
  • The last row is specific to homogeneous coordinates used in projective transformations.

Normalized Pixel Coordinates Generation

Before moving forward to the main reconstruction pipeline, we need to prepare the pixel coordinates that will be used for back-projecting the 2D image pixels into the 3D space.

def get_normalized_pixel_coordinates_pt3d(
y_resolution: int,
x_resolution: int,
device: torch.device = torch.device('cpu')
):
"""For an image with y_resolution and x_resolution, return a tensor of pixel coordinates
normalized to lie in [0, 1], with the origin (0, 0) in the bottom left corner,
the x-axis pointing right, and the y-axis pointing up. The top right corner
being at (1, 1).

Returns:
xy_pix: a meshgrid of values from [0, 1] of shape
(y_resolution, x_resolution, 2)
"""
xs = torch.linspace(1, 0, steps=x_resolution) # Inverted the order for x-coordinates
ys = torch.linspace(1, 0, steps=y_resolution) # Inverted the order for y-coordinates
x, y = torch.meshgrid(xs, ys, indexing='xy')
return torch.cat([x.unsqueeze(dim=2), y.unsqueeze(dim=2)], dim=2).to(device)
  • Function Definition: get_normalized_pixel_coordinates_pt3d creates a tensor of normalized pixel coordinates for an image with a given resolution. The coordinates are normalized to the range [0, 1].
  • Inversion of Axes: The function inverts the order of the x and y coordinates because in many image processing contexts, the origin (0,0) is at the top left corner, but we set it at the bottom left with the top right being (1, 1) following the PyTorch3D format.
  • Following are some major PyTorch3D format requirements that we have to follow through out the project:
    1: Intrinsics matrix should be of shape [4, 4], instead of [3, 3].
    2: PyTorch3D take inputs as a row vector for multiplication. So take transpose of pose matrices before passing into the PyTorch3D predefined methods.
  • Discover more about the main 3D tools and how to transform our inputs into the right format by reading “Mastering 3D Spaces”.
  • Meshgrid Creation: It uses torch.meshgrid to create a grid of x and y values that correspond to the normalized pixel locations across the image.
  • Tensor Concatenation: The x and y coordinates are concatenated and expanded into a two-dimensional grid, forming a 3D tensor where each pixel location is represented by a pair of values (x, y).
  • Device Assignment: The resulting tensor is moved to the specified device (CPU by default, but can be set to GPU for faster computations).

Mesh Cleaning Function

The clean_mesh function is a comprehensive utility that cleans and refines the mesh generated from the 3D reconstruction process. This step is crucial for ensuring that the final mesh is of high quality and free of common geometric artifacts.

def clean_mesh(vertices: torch.Tensor, faces: torch.Tensor, edge_threshold: float = 0.1, min_triangles_connected: int = -1, fill_holes: bool = True) -> (torch.Tensor, torch.Tensor, torch.Tensor):
"""
Performs the following steps to clean the mesh:

1. edge_threshold_filter
2. remove_duplicated_vertices, remove_duplicated_triangles, remove_degenerate_triangles
3. remove small connected components
4. remove_unreferenced_vertices
5. fill_holes

:param vertices: (3, N) torch.Tensor of type torch.float32
:param faces: (3, M) torch.Tensor of type torch.long
:param colors: (3, N) torch.Tensor of type torch.float32 in range (0...1) giving RGB colors per vertex
:param edge_threshold: maximum length per edge (otherwise removes that face). If <=0, will not do this filtering
:param min_triangles_connected: minimum number of triangles in a connected component (otherwise removes those faces). If <=0, will not do this filtering
:param fill_holes: If true, will perform trimesh fill_holes step, otherwise not.

:return: (vertices, faces, colors) tuple as torch.Tensors of similar shape and type
"""
if edge_threshold > 0:
# remove long edges
faces = edge_threshold_filter(vertices, faces, edge_threshold)

# cleanup via open3d
mesh = torch_to_o3d_mesh(vertices, faces) #, colors)
mesh.remove_duplicated_vertices()
mesh.remove_duplicated_triangles()
mesh.remove_degenerate_triangles()

if min_triangles_connected > 0:
# remove small components via open3d
triangle_clusters, cluster_n_triangles, cluster_area = mesh.cluster_connected_triangles()
triangle_clusters = np.asarray(triangle_clusters)
cluster_n_triangles = np.asarray(cluster_n_triangles)
triangles_to_remove = cluster_n_triangles[triangle_clusters] < min_triangles_connected
mesh.remove_triangles_by_mask(triangles_to_remove)

# cleanup via open3d
mesh.remove_unreferenced_vertices()

if fill_holes:
# misc cleanups via trimesh
mesh = o3d_to_trimesh(mesh)
mesh.process()
mesh.fill_holes()
return mesh
  • Edge Thresholding: Removes faces with edges longer than a specified threshold, helping to eliminate unrealistic triangles.
  • Remove Duplicate Triangles: Using Open3D, the function removes duplicated vertices and triangles and also degenerate triangles, which are triangles with very small areas that can cause rendering issues.
  • Removing Small Components: Eliminates small, isolated groups of triangles that might be floating away from the main mesh or represent noise.
  • Unreferenced Vertices: Gets rid of vertices that are not part of any triangle, cleaning up the data structure.
  • Filling Holes: An optional step using Trimesh to fill any holes in the mesh, making it watertight and more suitable for certain applications like 3D printing or simulation.

Mesh Triangulation from Point Cloud

This function is key to the 3D reconstruction process as it creates a mesh from a point cloud derived from depth information. It triangulates the points into faces to form a mesh using the image structure.

def get_mesh(world_space_points, depth, H, W):
# define vertex_ids for triangulation
'''
00---01
| |
10---11
'''
vertex_ids = torch.arange(H*W).reshape(H, W).to(depth.device)
vertex_00 = remapped_vertex_00 = vertex_ids[:H-1, :W-1]
vertex_01 = remapped_vertex_01 = (remapped_vertex_00 + 1)
vertex_10 = remapped_vertex_10 = (remapped_vertex_00 + W)
vertex_11 = remapped_vertex_11 = (remapped_vertex_00 + W + 1)


# triangulation: upper-left and lower-right triangles from image structure
faces_upper_left_triangle = torch.stack(
[remapped_vertex_00.flatten(), remapped_vertex_10.flatten(), remapped_vertex_01.flatten()], # counter-clockwise orientation
dim=0
)
faces_lower_right_triangle = torch.stack(
[remapped_vertex_10.flatten(), remapped_vertex_11.flatten(), remapped_vertex_01.flatten()], # counter-clockwise orientation
dim=0
)

# filter faces with -1 vertices and combine
mask_upper_left = torch.all(faces_upper_left_triangle >= 0, dim=0)
faces_upper_left_triangle = faces_upper_left_triangle[:, mask_upper_left]
mask_lower_right = torch.all(faces_lower_right_triangle >= 0, dim=0)
faces_lower_right_triangle = faces_lower_right_triangle[:, mask_lower_right]
faces = torch.cat([faces_upper_left_triangle, faces_lower_right_triangle], dim=1)

# clean mesh
mesh = clean_mesh(
vertices=world_space_points,
faces=faces,
)

return mesh

3D Mesh Reconstruction

In the main reconstruction pipeline, several key utility functionsget_normalized_pixel_coordinates_pt3d, clean_mesh, and get_meshhave been used for preparing the data, refining the mesh quality, and ensuring accurate triangulation. The main reconstruction script integrates these utilities to transform 2D images into a cohesive 3D mesh.

# Reconstruction

image_size = torch.tensor([img_resolution])
K = K_pt3d.unsqueeze(0)
all_meshes = list()

for idx in tqdm(range(len(points)), desc="Processing meshes"):
# Initialize matrices
R = R_pt3d[idx].unsqueeze(0)
T = T_pt3d[idx].unsqueeze(0)

# Define Camera
cam = pytorch3d.renderer.cameras.PerspectiveCameras(
R=R,
T=T,
K=K,
in_ndc=False,
image_size=[(1,1)],
device=device
)

# Render image
images = renderer(mesh, cameras=cam, lights=lights)
image = images[0]

# Depth
fragments = rasterizer(mesh, cameras=cam)
depths = fragments.zbuf
depth = depths[0]

#Back-Projection
xy_pix = get_normalized_pixel_coordinates_pt3d(img_resolution[0], img_resolution[1], device=device)
xy_pix = xy_pix.flatten(0, -2)
depth = depth.flatten(0, -2)
xyz = torch.cat((xy_pix, depth), dim=1)
world_points = cam.unproject_points(xyz)

# Replacing Nan with zeros
world_points = torch.where(torch.isnan(world_points), torch.zeros_like(world_points), world_points)
world_points = world_points[depth.squeeze()!=-1, :]


num_points, _ = world_points.shape
H = int((num_points + 1) ** 0.5)
W = int(num_points / H)

# Triangulation
triangulated_mesh = get_mesh(world_space_points=world_points.T, depth=depth, H=H, W=W)

# Append Mesh
all_meshes.append(triangulated_mesh)
  • image_size: Defines the resolution for image rendering.
  • K: Expands the camera intrinsic matrix to include a batch dimension, preparing it for multiple views.
  • all_meshes: A list initialized to store the reconstructed meshes from each camera viewpoint.

Looping through Camera Points:

  • R, T: Retrieve and reshape the rotation and translation matrices for each camera viewpoint to fit the batch processing format.
  • cam: Initialize the PerspectiveCameras object from PyTorch3D with the given rotation, translation, and camera intrinsics, defining a virtual camera.
  • images: Render the mesh for each camera view, applying the pre-defined lighting and camera settings.
  • image: Select the first image from the rendered batch for depth extraction.
Initial Rendered View of the Stanford Bunny: This image represents the first of the series of 2D renderings used in the 3D reconstruction process, capturing the silhouette and basic form of the iconic model.
  • fragments: Rasterize the mesh to get fragments, which include depth buffers among other things.
  • depths: Extract the z-buffer (depth values) from the fragments.
  • depth: Retrieve the depth buffer of the first image for back-projection.
Depth Map Rendering of the Stanford Bunny: This grayscale image depicts the depth information of the model, with varying shades representing the distance from the camera’s perspective.

Back-Projection process:

  • Normalized Pixel Coordinates (xy_pix): Create and standardize a grid of pixel coordinates using get_normalized_pixel_coordinates_pt3d, ensuring uniform mapping from 2D to 3D.
  • Flattening Coordinates (xy_pix and depth): Flatten the 2D coordinate grid and align it with the depth values for efficient data pairing.
  • Creating 3D Coordinates (xyz): Combine 2D coordinates with depth to form 3D points in the camera’s coordinate frame.
  • World Space Transformation (world_points): Using the unproject_points function, we convert these 3D points from camera coordinates to world coordinates, placing each point within a real-world 3D context.

Handling invalid points (Optional):

  • world_points: Process the data to replace NaN values and remove points with invalid depth information. This removal is critical as negative depth often indicates background elements. By filtering out these values, we focus only on the target object’s points, enhancing the accuracy of the reconstructed mesh and ensuring a clearer distinction between the object and its background.

Triangulation Preparation:

  • H, W: Calculate the height and width of the point cloud image based on the number of points.
  • triangulated_mesh: Use the get_mesh function to triangulate the point cloud and construct a mesh.

Finalize:

  • all_meshes: Append each newly reconstructed mesh to the list of meshes.

This code completes the 3D reconstruction by rendering 2D images from different camera angles, extracting depth information, converting it to 3D world coordinates, and then reconstructing a mesh from these points. Each step is crucial for the creation of a detailed and accurate 3D mesh from 2D images.

Mesh Simplification Function

The following function is used to simplify a 3D mesh using vertex clustering. This process is crucial for reducing the complexity of a mesh while preserving its overall shape and features.

def simplify_mesh(mesh):
voxel_size = 0.02
device = "cuda"
v = mesh.vertices
f = mesh.faces
dtype_v = v.dtype
dtype_f = f.dtype
m = o3d.geometry.TriangleMesh()
m.vertices = o3d.utility.Vector3dVector(v.astype(np.float64))
m.triangles = o3d.utility.Vector3iVector(f.astype(np.int32 ))
m = m.simplify_vertex_clustering(voxel_size=voxel_size)
v = np.asarray(m.vertices ).astype(dtype_v)
f = np.asarray(m.triangles ).astype(dtype_f)
v = torch.from_numpy(v).to(device=device)
f = torch.from_numpy(f).to(device=device)

mesh = trimesh.Trimesh(
vertices = v.cpu(),
faces = f.cpu()
)
return mesh
  • Voxel Size: Sets the size of the voxel used for vertex clustering, determining the degree of simplification.
  • Device Setting: Specifies the device (e.g., CUDA for GPU) for computation.
  • Vertices and Faces: Extracts the vertices (v) and faces (f) of the input mesh, along with their data types.
  • Open3D Mesh Conversion: Converts the mesh into Open3D’s TriangleMesh format for processing.
  • Mesh Simplification: Applies the vertex clustering simplification using the specified voxel_size. This technique groups vertices within each voxel, effectively reducing the number of vertices and simplifying the mesh.
  • Reconversion to Original Format: Converts the simplified mesh back into numpy arrays and then to PyTorch tensors, maintaining the original data types.
  • Trimesh Reconstruction: Rebuilds the simplified mesh using Trimesh for further use or visualization, ensuring compatibility with other parts of the pipeline.

Mesh Merging Function

This code snippet provides a function to merge two separate 3D meshes into a single mesh. This is especially useful in scenarios where multiple reconstructed meshes from different viewpoints need to be combined into a unified model.

def merge_mesh(mesh1, mesh2):
v1 = torch.tensor(mesh1.vertices)
f1 = torch.tensor(mesh1.faces)

v2 = torch.tensor(mesh2.vertices)
f2 = torch.tensor(mesh2.faces)

v = torch.cat([v1, v2], dim=0)
f = torch.cat([f1, f2 + len(v1)], dim=0)

merged_mesh = trimesh.Trimesh(
vertices = v.cpu().numpy(),
faces = f.cpu().numpy()
)

return merged_mesh

  • Vertex and Face Extraction: Extracts vertices (v1, v2) and faces (f1, f2) from the two input meshes.
  • Concatenation: Combines the vertices of both meshes (v) and adjusts the face indices of the second mesh (f2) to align with the new combined vertex list before concatenating the faces.
  • New Mesh Creation: Constructs a new Trimesh object with the combined vertices and faces, resulting in a merged mesh.
  • Return Merged Mesh: Outputs the final merged mesh.

Using the Function in the Pipeline

After defining the merging function, the code uses it to combine all reconstructed meshes stored in all_meshes:

# save combined mesh
mesh_combined = all_meshes[0]
for current_mesh in all_meshes[1:]:
mesh_combined = merge_mesh(mesh_combined, current_mesh)
# Optional simplification step
# mesh_combined = simplify_mesh(mesh_combined)

save_mesh(mesh_combined, "outputs/combined_mesh.ply")
  • Initial Mesh Setup: Starts with the first mesh as the base for combination.
  • Iterative Merging: Loops through the remaining meshes, merging each one sequentially with the combined mesh.
  • Optional Simplification: There’s an option to simplify the mesh after each merge. This can reduce the overall mesh size but may also affect the mesh’s detail level.
  • Saving the Final Mesh: The final combined mesh is saved to a specified file, providing a complete 3D representation derived from multiple meshes.

Final 3D Mesh

Visualization

The final mesh shows a clear view of the Stanford Bunny from the sides, where we have place our cameras. However, few parts of the bunny’s top and bottom are missing. This happened because we didn’t have cameras looking from above or below during the imaging.

Final 3D Mesh of the Stanford Bunny

Enhancing 3D Mesh Reconstructions: Addressing Missing Views

To address the missing views at the top and bottom of the bunny, two key solutions can be applied:

  • Add cameras above and below the object to get better coverage.
  • Use advanced techniques like diffusion models or inpainting to estimate and fill in the appearance of unobserved regions, resulting in a more complete 3D reconstruction.

Conclusion

Our 3D mesh reconstruction of the Stanford Bunny highlights the potential of using multiple side-view cameras for accurate model generation. While we achieved impressive results in capturing the bunny’s profile, the absence of views from above and below poses a significant challenge in achieving a fully enclosed 3D model. To address this limitation, we discussed two valuable solutions: expanding camera coverage to include missing angles or leveraging advanced techniques like diffusion models and inpainting. These approaches enhance the accuracy and comprehensiveness of 3D reconstructions.

Common Pitfalls:

  • Scaling Issues: Ensure that all input points, camera intrinsic matrices, pixel grids, and depth data are consistently scaled to prevent scaling issues that may affect the reconstruction quality.
  • Incomplete Coverage: A common pitfall is relying solely on side-view cameras, leading to missing details in the reconstruction.
  • Data Quality: Poor image quality or calibration issues can result in inaccurate depth information and affect the final mesh.
  • Post-processing Complexity: Applying advanced techniques like diffusion models requires expertise and computational resources.
  • Coordinate System Consistency: Make sure you are following the same coordinate convention or format preferred by the framework you are using. Understanding the differences and translations between coordinate systems used by OpenCV, PyTorch3D, COLMAP, and OpenGL is essential for accurate 3D reconstruction.

GitHub Repository

For those interested in exploring the code further, you can find the complete implementation on our GitHub repository: 3D-Mesh-Generation

References and Additional Resources

  • Photogrammetry: A great lecture series by Professor Cyrill Stachniss, to learn more about basic concepts and photogrammetric techniques used for 3D reconstruction.
  • Coordinate Systems and Translations Between Frameworks: Gain a deeper understanding of coordinate systems and how they translate between popular frameworks like OpenCV, PyTorch3D, COLMAP, and OpenGL.
  • Usage of Diffusion Models for Inpainting: Explore advanced techniques like diffusion models for image inpainting and missing view estimation.
    1: Text2Room
    2: RGBD-Diffusion
  • High fidelity surface reconstruction: Neuralanglo
  • PyTorch3D Cameras
  • PyTorch3D Documentation: Access the official PyTorch3D documentation for in-depth guidance on 3D reconstruction and computer vision tasks.
  • COLMAP Official Website: Visit the official COLMAP website for comprehensive information on structure-from-motion (SfM) and dense reconstruction software.
  • OpenCV Documentation: Explore the official documentation of OpenCV, a widely-used computer vision library.
  • Trimesh Documentation: Access documentation for Trimesh, a powerful library for 3D mesh manipulation and processing.

--

--