Create 3D model from a single 2D image in PyTorch.

How to efficiently train a Deep Learning model to construct 3D object from one single RGB image.

In recent years, Deep Learning (DL) has demonstrated outstanding capabilities in solving 2D-image tasks such as image classification, object detection, semantic segmentation, etc. Not an exception, DL has showed tremendous progresses in applying it to 3D graphic problems. In this post we will explore a recent attempt of extending DL to the Single image 3D reconstruction task, one of the most important and profound challenge in the field of 3D computer graphics.

The task

A single image is only a projection of 3D object into a 2D plane, so some data from the higher dimension space must be lost in the lower dimension representation. Therefore from a single-view 2D image, there will never be enough data construct its 3D component.

A method to create the 3D perception from a single 2D image therefore requires prior knowledge of the 3D shape in itself.

In 2D Deep Learning, a Convolutional AutoEncoder is a very efficient method to learn a compressed representation of input images. Extending this architecture into learning a compact shape knowledge is the most promising way to apply Deep Learning to 3D data.

CNN encodes deep shape prior knowledge.

Representation of 3D data

Different representations of 3D data.

Unlike a 2D image that has only one universal representation in computer format (pixel), there are many ways to represent 3D data in in digital format. They come with their own advantages and disadvantages, so the choice of data representation directly affected the approach that can be utilized.

Rasterized form (Voxel grids) : Can directly apply CNN

Each blue box is a single voxel, most of the voxels is empty.

Voxel, in short for volumetric pixel, is the direct extension of spatial-grid pixels into volume-grid voxels. The locality of each voxels together define the unique structure of this volumetric data, so the locality assumption of ConvNet still hold true in volumetric format.

Low density of voxels representation.

However this representation is sparse and wasteful. The density of useful voxels decreases as the resolution increases.

  • Advantage: Can directly apply CNN from 2D to 3D representation.
  • Disadvantage: Wasteful representation, high tradeoffs between details and resources (computation, memory).

Geometric forms: Can not directly apply CNN

Point Cloud representation of a chair.

Polygonal mesh: is collection of vertices, edges and faces that defines the objects’ surface in 3 dimensions. It can capture granular details in a fairly compact representation.

Point Cloud: A collection of points in 3D coordinate (x, y, z), together these points form a cloud that resemble the shape of object in 3 dimension. The larger the collection of points, the more details it gets. The same set of points in different order still represents the same 3D object.

  • Advantage: Compact representation, focus on the details surface of 3D objects.
  • Disadvantage: Cannot directly apply CNN.
# point_cloud1 and point_cloud2 represent the same 3D structure
# even though they are represented differently in memory
point_cloud1 = [(x1, y1, z1), (x2, y2, z2),..., (xn, yn, zn)]
point_cloud2 = [(x2, y2, z2), (x1, y1, z1),..., (xn, yn, zn)]


We will show an implementation that combine the advantages of Point Cloud compact representation but use traditional 2D ConvNet to learn the prior shape knowledge.

2D Structure Generator

We will build a standard 2D CNN Structure Generator that learn the prior shape knowledge of an object. The voxel approach is not desired because it’s inefficient, and it’s not possible to directly learn a point cloud with CNN. Therefore we will instead learn the mapping from a single image to multiple 2D projection of a point cloud, with a 2D projection at a viewpoint defined as: 2D projection == 3D coordinates (x,y,z) + binary mask (m)

  • Input: Single RGB image
  • Output: 2D projections at predetermined viewpoints.
#--------- Pytorch pseudo-code for Structure Generator ---------#
class Structure_Generator(nn.Module):
# contains two module in sequence, an encoder and a decoder
def __init__(self):
self.encoder = Encoder()
self.decoder = Decoder()
    def forward(self, RGB_image):
# Encoder takes in one RGB image and
# output an encoded deep shape-embedding
shape_embedding = self.encoder(RGB_image)

# Decoder takes the encoded values and output
# multiples 2D projection (XYZ + mask)
XYZ, maskLogit = self.decoder(shape_embedding)

return XYZ, maskLogit

Point Cloud Fusion

Fuse the predicted 2D projections into a native 3D point cloud data. This is possible because the viewpoints of these predictions are fixed and known beforehand.

  • Input: 2D projections at predetermined viewpoints.
  • Output: Point cloud


We reason that, if the Point Cloud fused from the predicted 2D projections are of any good, then if we rendered different 2D projections from new viewpoints, it should resemble the projections from the ground truth 3D model too.

  • Input: Point cloud
  • Output: depth images at novel viewpoints.

Training dynamic:

Complete architecture from 2D convolution Structure Generator, Fusion and Pseudo-rendering modules.

Combining the 3 modules together, we obtained and end-to-end model that learns to generate a compact point cloud representation from one single 2D image, using only 2D convolution structure generator.

The clever trick of this model is to make the fusion + pseudo-rendering modules purely differentiable, geometric reasoning:

  • Geometric algebra means no learnable parameters, make the model size smaller and easier to train.
  • Differentiable means we can back-propagate the gradients through it, making it possible to use the loss from 2D projections to learn to generate 3D point cloud.
# --------- Pytorch pseudo-code for training loop ----------#
# Create 2D Conv Structure generator
model = Structure_Generator()
# only need to learn the 2D structure optimizer
optimizer = optim.SGD(model.parameters())
# 2D projections from predetermined viewpoints
XYZ, maskLogit = model(RGB_images)
# fused point cloud
#fuseTrans is predetermined viewpoints info
XYZid, ML = fuse3D(XYZ, maskLogit, fuseTrans)
# Render new depth images at novel viewpoints
# renderTrans is novel viewpoints info
newDepth, newMaskLogit, collision = render2D(XYZid, ML, renderTrans)
# Compute loss between novel view and ground truth
loss_depth = L1Loss()(newDepth, GTDepth)
loss_mask = BCEWithLogitLoss()(newMaskLogit, GTMask)
loss_total = loss_depth + loss_mask
# Back-propagation to update Structure Generator


  • Comparison of novel depth image from ground truth 3D model and the rendered depth image from the learned point cloud model.
  • Final result: From one single RBG image → 3D point cloud

With the detailed point cloud representation, it's possible to use MeshLab to convert it to other representations such as voxel or polygonal mesh that are 3D-printer compatible.