# Create 3D model from a single 2D image in PyTorch.

How to efficiently train a Deep Learning model to construct 3D object from one single RGB image.

In recent years, Deep Learning (DL) has demonstrated outstanding capabilities in solving 2D-image tasks such as image classification, object detection, semantic segmentation, etc. Not an exception, DL has showed tremendous progresses in applying it to 3D graphic problems. In this post we will explore a recent attempt of extending DL to the **Single image 3D reconstruction **task, one of the most important and profound challenge in the field of 3D computer graphics.

### The task

A single image is only a projection of 3D object into a 2D plane, so some data from the higher dimension space must be lost in the lower dimension representation. Therefore from a single-view 2D image, there will never be enough data construct its 3D component.

A method to create the 3D perception from a single 2D image therefore requires **prior knowledge** of the 3D shape in itself.

In 2D Deep Learning, a Convolutional AutoEncoder is a very efficient method to learn a compressed representation of input images. Extending this architecture into learning a compact *shape* knowledge is the most promising way to apply Deep Learning to 3D data.

### Representation of 3D data

Unlike a 2D image that has only one universal representation in computer format (pixel), there are many ways to represent 3D data in in digital format. They come with their own advantages and disadvantages, so the choice of data representation directly affected the approach that can be utilized.

#### Rasterized form (Voxel grids) : Can directly apply CNN

**Voxel, **in short for volumetric pixel, is the direct extension of spatial-grid pixels into volume-grid voxels. The locality of each voxels together define the unique structure of this volumetric data, so the locality assumption of ConvNet still hold true in volumetric format.

However this representation is sparse and wasteful. The density of useful voxels decreases as the resolution increases.

**Advantage:**Can directly apply CNN from 2D to 3D representation.**Disadvantage:**Wasteful representation, high tradeoffs between details and resources (computation, memory).

#### Geometric forms: Can not directly apply CNN

**Polygonal mesh**: is collection of *vertices, edges and faces *that defines the objects’ surface in 3 dimensions. It can capture granular details in a fairly compact representation.

**Point Cloud**: A *collection* of points in 3D coordinate (x, y, z), together these points form a cloud that resemble the shape of object in 3 dimension. The larger the collection of points, the more details it gets. The same set of points in different order still represents the same 3D object.

**Advantage:**Compact representation, focus on the details surface of 3D objects.**Disadvantage:**Cannot directly apply CNN.

# point_cloud1 and point_cloud2 represent the same 3D structure

# even though they are represented differently in memory

point_cloud1 = [(x1, y1, z1), (x2, y2, z2),..., (xn, yn, zn)]

point_cloud2 = [(x2, y2, z2), (x1, y1, z1),..., (xn, yn, zn)]

### Approach

We will show an implementation that combine the advantages of Point Cloud **compact representation** but use traditional **2D ConvNet** to learn the prior shape knowledge.

#### 2D Structure Generator

We will build a standard 2D CNN Structure Generator that learn the prior shape knowledge of an object. The *voxel approach *is* *not desired because it’s inefficient, and it’s not possible to directly learn a point cloud with CNN. Therefore we will instead learn the mapping from a single image to *multiple 2D projection* of a point cloud, with a 2D projection at a viewpoint defined as: `2D projection == 3D coordinates (x,y,z) + binary mask (m)`

- Input: Single RGB image
- Output: 2D projections at
viewpoints.*predetermined*

#--------- Pytorch pseudo-code for Structure Generator ---------#

class Structure_Generator(nn.Module):

# contains two module in sequence, an encoder and a decoder

def __init__(self):

self.encoder = Encoder()

self.decoder = Decoder()

def forward(self, RGB_image):

# Encoder takes in one RGB image and

# output an encoded deep shape-embedding

shape_embedding = self.encoder(RGB_image)

# Decoder takes the encoded values and output

# multiples 2D projection (XYZ + mask)

XYZ, maskLogit = self.decoder(shape_embedding)

return XYZ, maskLogit

#### Point Cloud Fusion

Fuse the predicted *2D projections* into a native 3D point cloud data. This is possible because the viewpoints of these predictions are fixed and known beforehand.

- Input: 2D projections at
viewpoints.*predetermined* - Output: Point cloud

#### Pseudo-Renderer

We reason that, if the Point Cloud fused from the predicted 2D projections are of any good, then if we rendered different 2D projections from *new* viewpoints, it should resemble the projections from the ground truth 3D model too.

- Input: Point cloud
- Output: depth images at
viewpoints.*novel*

#### Training dynamic:

Combining the 3 modules together, we obtained and end-to-end model that learns to generate a *compact point cloud* representation from *one single 2D image*, using only *2D convolution* structure generator.

The clever trick of this model is to make the fusion + pseudo-rendering modules purely *differentiable,*** geometric** reasoning:

- Geometric algebra means no learnable parameters, make the model size smaller and easier to train.
- Differentiable means we can back-propagate the gradients through it, making it possible to use the loss from 2D projections to learn to generate 3D point cloud.

# --------- Pytorch pseudo-code for training loop ----------#

# Create 2D Conv Structure generator

model = Structure_Generator()

# only need to learn the 2D structure optimizer

optimizer = optim.SGD(model.parameters())

# 2D projections from predetermined viewpoints

XYZ, maskLogit = model(RGB_images)

# fused point cloud

#fuseTrans is predetermined viewpoints info

XYZid, ML = fuse3D(XYZ, maskLogit, fuseTrans)

# Render new depth images at novel viewpoints

# renderTrans is novel viewpoints info

newDepth, newMaskLogit, collision = render2D(XYZid, ML, renderTrans)

# Compute loss between novel view and ground truth

loss_depth = L1Loss()(newDepth, GTDepth)

loss_mask = BCEWithLogitLoss()(newMaskLogit, GTMask)

loss_total = loss_depth + loss_mask

# Back-propagation to update Structure Generator

loss_total.backward()

optimizer.step()

### Results:

- Comparison of novel depth image from ground truth 3D model and the rendered depth image from the learned point cloud model.

- Final result: From one single RBG image → 3D point cloud

With the detailed point cloud representation, it's possible to use MeshLab to convert it to other representations such as voxel or polygonal mesh that are 3D-printer compatible.

### References:

- Pytorch code: https://github.com/lkhphuc/pytorch-3d-point-cloud-generation
- Tensorflow code: https://github.com/chenhsuanlin/3D-point-cloud-generation
- Paper: https://arxiv.org/abs/1706.07036
- Original project website: https://chenhsuanlin.bitbucket.io/3D-point-cloud-generation/