Creating XR Content from Text: An Exploration of AI-Generated 3D Models

Raju K
XRPractices
Published in
6 min readJan 14, 2023

In this article, we will explore how AI can be used to prototype XR content by converting text into 3D content. Specifically, we will delve into the capabilities of text-to-image AI and how these generated images can be transformed into 3D for use in XR prototyping.

Stable Diffusion, developed by stability.ai, is a cutting-edge technology that can transform text into photorealistic images. It has the capability to create highly detailed and intricate images, even from abstract and complex ideas. stability.ai has open sourced the Stable Diffusion model for use by communities.

Here are some examples of the prompts I used to create a concept for my home interior, along with the resulting images.

Prompt:

high resolution photography of a minimalistic white interior kitchen with
wooden floor, beige blue salmon pastel, wide angle, sunlight, contrast,
realistic artstation concept art, hyperdetailed, ultradetail, cinematic 8k,
architectural rendering, unreal engine 5, rtx, volumetric light,
cozy atmosphere

Kitchen generated by Stable Diffusion

I kept on altering the same prompt to generate similar models for my other rooms and here are the results

Living Room generated by Stable Diffusion
Bedroom generated by Stable Diffusion

To view these results in a 3D environment such as VR or AR, additional tools and setup are required. The process of preparing the model for viewing in 3D is outlined below.

Image to GLTF Workflow

The key components of the workflow, which are outlined below.

Depth Detection:

feature_extractor = DPTFeatureExtractor.from_pretrained("Intel/dpt-large")
model = DPTForDepthEstimation.from_pretrained("Intel/dpt-large",ignore_mismatched_sizes=True)

def process_image(image_path):
image_path = Path(image_path)
image = Image.open(image_path)
encoding = feature_extractor(image, return_tensors="pt")

# forward pass
with torch.no_grad():
outputs = model(**encoding)
predicted_depth = outputs.predicted_depth

# interpolate to original size
prediction = torch.nn.functional.interpolate(
predicted_depth.unsqueeze(1),
size=image.size[::-1],
mode="bicubic",
align_corners=False,
).squeeze()
output = prediction.cpu().numpy()
depth_image = (output * 255 / np.max(output)).astype('uint8')

img = Image.fromarray(depth_image) #depth image generated for the input image
  • Initialise a feature extractor and a model using the “Intel/dpt-large” pre-trained model. The DPTFeatureExtractor is used to extract features from the input RGB images, and the DPTForDepthEstimation is used to predict depth maps from the extracted features.
  • Input the RGB image to the feature extractor to extract features from it.
  • Pass the extracted features to the DPTForDepthEstimation model to generate a depth map.

RGBD Image:

Next we generate an RGBD image from RGB & Depth combination using Open3D.

depth_o3d = o3d.geometry.Image(depth_image)
image_o3d = o3d.geometry.Image(rgb_image)
rgbd_image = o3d.geometry.RGBDImage.create_from_color_and_depth(
image_o3d, depth_o3d, convert_rgb_to_intensity=False)

Point Cloud Generation:

To generate a point cloud from this RGBD image, we need to provide Open3D with the camera intrinsic parameters that were used to capture the image. This will allow Open3D to accurately calculate the depth information from the image, and generate a reliable point cloud.

When working with AI generated images, it is important to keep in mind that there is no physical camera that was used to capture the image. As a result, the intrinsic parameters of the camera are not directly available.

One way to overcome this issue is to use information provided in the AI prompt. For example, if the prompt includes keywords such as “wide angle”, it can be assumed that the desired field of view for the image is wide. A typical wide angle lens has a field of view ranging from 80 to 110 degrees, this information can be used to estimate the intrinsic parameters needed for point cloud calculation.

w = int(depth_image.shape[1])
h = int(depth_image.shape[0])

fx = (w/2) / np.tan(80/2) # Considering FoV 80 in horizontal axis
fy = (h/2) / np.tan(80/2) # Considering FoV 80 in vertical axis
camera_intrinsic = o3d.camera.PinholeCameraIntrinsic()
camera_intrinsic.set_intrinsics(w, h, fx, fy, w/2, h/2)

The set_intrinsics method of the PinholeCameraIntrinsic class in Open3D is used to set the intrinsic parameters of a pinhole camera. It takes four parameters:

  1. width: the width of the image in pixels.
  2. height: the height of the image in pixels.
  3. fx: the horizontal focal length of the camera in pixels. This parameter determines the field of view of the camera along the x-axis.
  4. fy: the vertical focal length of the camera in pixels. This parameter determines the field of view of the camera along the y-axis.
  5. cx: the x-coordinate of the principal point (the point where the optical axis of the camera intersects the image plane) in pixels.
  6. cy: the y-coordinate of the principal point in pixels.

The width and height parameters define the resolution of the image, while the fx and fy parameters determine the field of view of the camera. The cx and cy parameters specify the location of the principal point in the image.

After generating the point cloud from the RGBD image, it’s important to clean it up and construct normals.

pcd = o3d.geometry.PointCloud.create_from_rgbd_image(
rgbd_image, camera_intrinsic)

pcd.remove_statistical_outlier(6,1.5)
pcd.normals = o3d.utility.Vector3dVector(
np.zeros((1, 3))) # invalidate existing normals
pcd.estimate_normals(
search_param=o3d.geometry.KDTreeSearchParamHybrid(radius=0.01, max_nn=30))
pcd.orient_normals_towards_camera_location(
camera_location=np.array([0., 0., 100.]))
pcd.transform([[1, 0, 0, 0],
[0, -1, 0, 0],
[0, 0, -1, 0],
[0, 0, 0, 1]])
pcd.transform([[-1, 0, 0, 0],
[0, 1, 0, 0],
[0, 0, 1, 0],
[0, 0, 0, 1]])

pcd.remove_statistical_outlier(6,1.5) - This is important line that removes statistical outliers from the point cloud. It removes the points which are farther than the mean distance plus 1.5 times the standard deviation of the distances to the nearest 6 neighbors. This helps to remove any noise or outliers that may be present in the point cloud. Then rest of the code invalidates the existing normals and recalculates them.

The final step in the process is to construct a mesh from the point cloud. We used Poisson surface reconstruction method to create a mesh from the point cloud. This method is known for its ability to handle noisy and incomplete point clouds, making it well suited for AI-generated images.

Next, the mesh is simplified using voxel clustering to reduce the number of triangles in the mesh. This helps to improve the performance and reduce the file size of the final mesh. A bounding box is calculated around the point cloud and mesh is cropped to the bounding box. This helps to eliminate any parts of the mesh that are outside of the point cloud boundary.

Finally, the mesh is filtered using Laplacian smoothing to remove any contours or artifacts in the mesh. This helps to create a smooth and visually pleasing final mesh. The mesh is saved as GLTF and PLY file format.

with o3d.utility.VerbosityContextManager(o3d.utility.VerbosityLevel.Debug) as cm:
mesh_raw, densities = o3d.geometry.TriangleMesh.create_from_point_cloud_poisson(
pcd, depth=depth, width=0, scale=1.1, linear_fit=False)

voxel_size = 0.000005
print(f'voxel_size = {voxel_size:e}')
mesh = mesh_raw.simplify_vertex_clustering(
voxel_size=voxel_size,
contraction=o3d.geometry.SimplificationContraction.Average) # Simplify the mesh

bbox = pcd.get_axis_aligned_bounding_box()
mesh_crop = mesh.crop(bbox)
mesh_crop = mesh_crop.filter_smooth_laplacian(number_of_iterations = 2)
gltf_path = f'./{image_path.stem}.gltf'
o3d.io.write_triangle_mesh(
gltf_path, mesh_crop, write_triangle_uvs=True)
ply_path = f'./{image_path.stem}.ply'
o3d.io.write_triangle_mesh(
ply_path, mesh_crop, write_triangle_uvs=True)

The GLTF file format is widely used in the VR and AR industry and can be easily imported into various VR and AR development platforms and libraries. The PLY file format, on the other hand, is commonly used for 3D scanning and modelling software and can be used to further refine and edit the mesh if needed.

Here’s the GLTF for the above images

Living Room
Bed Room
Kitchen

The visual quality isn’t that great right? Hold on. We are yet to discuss about one additional step which is “Texturing”. Texturing is the process of adding colors and textures to a 3D model, which can greatly improve its visual appeal. When applied correctly, texturing can make a 3D model look more realistic and lifelike. It’s a subject that is worth exploring further in next article if I get enough claps and shares for this article :)

As promised, here goes the next article.

--

--

Raju K
XRPractices

Innovator | XR | AR | VR| Robotics Enthusiast | Thoughtworks