3D Face and Body Reconstruction

Ritik Nandwal
OffNote Labs
Published in
12 min readSep 20, 2021

--

3D face/body reconstruction is the task of reconstructing face/body into its 3d form (mesh) given one or more 2d images as input. A large number of models and techniques have been proposed to address the 3D reconstruction problem.

In this article, we will discuss some of the prominent models used for the problem: 3DMM, FLAME, SMPL, SMPL-X.

3D Face Reconstruction

So, how do we reconstruct the 3d object of the face from a given 2d image as input. It is an important task in the field of computer vision and 3d graphics and animation. There are two main methods.

Firstly, we can directly regress the 3d voxels (pixels in 3d are equivalent to voxels) using CNN. This is a generalized method for 3d reconstruction. In this method input are 2D images and 3D facial scans of human heads, CNN predicts the 3D voxels, weights are learned by minimizing the loss function (difference between ground truth and predicted 3d parameters).

Regressing 3d voxels usually takes a longer time, as the CNN has to learn a large number of parameters (more than 20,000). So research has come to build models specific to the human face, hands, and body and regress them from monocular 2d images. We will discuss some of the prominent ones here.

3DMM: It stands for 3D morphable face model, where the face can be represented as a linear combination of shape and texture (variation in color across different points in 3d). The model needs to predict both identity and expression parameters (the two components of the 3DMM model). Input is 2d image and ground truth 3dmm parameters and output is predicted values of 3dmm parameters. Parameters are learned by minimizing the value of the loss function. The 3DMM parameters can be regressed using any model like resnet or mobilenet. The number of parameters when compared to the parameters obtained in directly regressing 3D voxels is very less, hence it is faster and can be used for real-time applications where speed is an important factor.

3DMM (3D morphable face model)

Face can be represented as a linear combination of shape and texture(variation in color across different points in 3d). It is a statistical model which transforms the shape and texture of the faces into a vector space representation. Shape vector consists of (x,y,z) coordinates of face, and Texture vector consists of (r,g,b) values of corresponding shape vector coordinates.

The neutral face is the basis of every face, i.e face without any expressions which can be represented as a linear combination of shape and texture vectors.

Shape Model Equation
Texture Model Equation
  • S_bar and T_bar denote mean shape and mean texture vectors.
  • m → Number of eigenvectors of shape and texture covariance matrices
  • s_i and t_i → ith eigenvectors of shape and texture covariance matrices
  • alpha and beta →shape and texture parameters of S_model and T_model

To add expressions, we add expression variation term to the equation of neutral face.

Face Model with expression variation
  • alpha_j → expression parameter and s_j → eigenvectors for expressions

3DMM parameters consist of [f, pitch, yaw, roll,t2d,alpha_id,alpha_exp] terms where f → constant value that is used as a scale factor in the equation | pitch, yaw, and roll → rotations around three perpendicular axis | t2d → translation vector | alpha_id and alpha_exp → shape and expression parameters respectively.

Flame Model

FLAME is a different lightweight and expressive generic head model, learned from over 33,000 accurately aligned 3D scans. To understand this model, we first need to understand the SMPL model, so we will discuss it in the latter part of the article.

Most methods to 3D-reconstruct human body use SMPL and SMPL-X models. Before going into them, we need to understand a few key concepts: skinning, blendshapes, and template meshes.

Skinning

To reconstruct 3d body, we detect its skeletal pose and then construct 3d mesh on it to capture the surface/shape of the body. Skinning is a method used to attach the mesh to the skeletal structure. The position of a vertex in a mesh depends on multiple bones. Each bone has a deformation of the space around it, we can’t attach the vertex to just a single bone because in this case, its position during movement will be rigid, rather it depends on the position where the vertex lies. For example, at a joint skin is deformed due to two or more bones, therefore it must be a weighted sum of those bones. Whereas a point lying in the middle of the limb will almost depend only on one single bone and its movement will be almost rigid. Read more in [1].

There are different types of skinning:

  • Linear: It has one disadvantage, at joints, it deforms unusually.
Collapse and Candy Wrapper Effects in Linear Blend Skinning
  • Dual-Quaternion: Almost has natural deformation at joints.

Blendshapes

It is a set of vertex offsets that can be linearly interpolated to generate expressions. It can be considered as a collection of different node points connected to each other, when one node is displaced, all other nodes connected to it get displaced.

Template Mesh/Mean Template Shape(T)

It is a T-shaped, neutral pose, 3d mesh representative of the human body shape, which can be used as a base for reconstructing differently shaped human bodies.

Mean Template Shape

Blend Weights

Weights are used to control the movement of different regions in blend shapes. The different colors in the above template mesh represent different blend weights.

Pose Dependent Blend Shape

The surface of the body deforms according to the pose of the person. To capture this, a function is learned to add required changes to the Template shape, keeping the identity of the person as a constant attribute.

Identity Dependent blend Shape

Blend shape function learned to add shape changes to the template shape, which are specific to the identity of the person, keeping pose as a constant attribute.

Now, we discuss two popular 3d human models: SMPL and SMPL-X.

SMPL

It is a skinned vertex-based model (meaning it construct surface on top of the skeleton) that accurately represents different varieties of human bodies in natural human poses. The parameters of the model are learned from data including the rest pose template, blend weights, pose-dependent blend shapes, identity-dependent blend shapes.

Recall how the 3DMM model decomposes the representation of the face into identity and expressions/texture. Similarly, SMPL decomposes the body shape representation into the identity shape and pose-dependent shape.

B_Shape+Pose = B_p(theta)+B_s(beta) , where B is blend shape function

identity shape → It means varying the pose, such that identity remains constant.

pose-dependent-shape → In this the pose is always the same, but identity is varied.

Once these objectives are achieved, we can add both the blend shape functions linearly, to represent any body shape and pose.

Let us now see how these different components come together.

SMPL Model

SMPL model breaks down body shape into identity and non-rigid pose dependent shape. SMPL already has neutral template mesh (a), which is learned from thousands of 3d body scans. A blend shape function B_S takes a vector of shape parameters beta as input and outputs blend shapes (deviations from template mesh) for the input image, which are then added to template mesh to produce (b). Similarly pose blendshape (deformation due to pose of person) are calculated using the B_p function and added to above result to obtain the result in (c). Finally, a standard blend skinning function W (·) (linear or dual-quaternion) is applied to rotate the vertices around the estimated joint centers with smoothing (meaning the surface must be continuous without any breaks) defined by the blend weights to produce result in (d).

Now since we have understood the working of the SMPL model, we can have a look at the FLAME model for faces which uses SMPL’s method as the base and reconstructs 3d face from 2d image as input.

FLAME

FLAME model consists of Template mesh T for face in neutral pose. The shape blendshape function S() handles identity-related shape variation on top of the template mesh. The pose blendshape function P() corrects the pose deformation of the head. These above functions were derived from SMPL , Additionally, an expression blendshape function E is added, to capture the facial expression.

The skinning function is applied to template mesh linearly combined with shape, pose and expression blendshape functions to generate the final output.

T + S() + P() + E() passed as input to skinning function, W(T,S,P,E) which generates 3d face. The skinning function also rotates the vertices of the face around joints J linearly smoothed (meaning there are no gaps in 3d reconstructed output) by blend-weights W. This finally results in 3d reconstructed face with expressions.

SMPL-X

SMPL-X is an extended version of SMPl, specialized for capturing hands and facial expressions, and reconstructs 3d body specific to gender.

The SMPL-X uses a similar decomposition as SMPL, where additionally the pose parameter theta is broken down into pose parameters for jaw, finger and remaining body joints. This, in turn, helps in specializing for capturing facial expressions and hand poses.

Left 3d body is the output of SMPL, Right 3d body is of SMPL-X

It uses a better pose prior by applying collision penalty to impossible poses, by detecting a list of colliding triangles in the mesh.

One key difference when compared to SMPL is that SMPL-X breaks down the pose(theta) into 3 parts (theta_f → for jaw joint, theta_h → for finger joints, theta_b → for remaining body joints), these are then again combined back using weighted sum loss function.

Additionally, SMPL-X has got some important features. It has got a separate function for penalizing impossible poses. SMPL used an approximation of negative log of gaussian mixture model trained on MoCap dataset, but it’s not that efficient and fails in many cases because it makes symmetrical changes in the body. Consider a case when a person is just moving his right hand and left hand is still, but deformations will be seen in the left hand, this is a drawback of SMPL.

SMPL-X uses a Variational Autoencoder which learns a latent representation of human pose and regularizes the distribution of the latent code to be a normal distribution.

SMPL-X uses pose space and pose corrective blendshape parameters of MANO hand model and expression space parameters from FLAME model because full-body scans had limited resolution for hands and face. MANO and FLAME were specialized for 3d hand reconstruction and facial expressions respectively. So a better option is to use their trained parameters rather than training from scratch for hands and expressions. It handles body folds by using separate functions for penalizing self-collisions and penetrations of different body parts that are physically impossible. For gender classification, they used a gender classifier that takes input images containing full body and openpose joints and assigns gender labels to detected persons.

We now discuss some methods for 3D reconstruction DECA and Pixie., which employ one or more of above models.

DECA

It is a combination of the FLAME model with UV displacement maps.

UV-Displacement Maps: They are helpful in capturing wrinkle-level deformations on a surface.

UV-Map of Face

They are also known as height maps/bump maps used to capture heightened displacement in 3d, the parts which are at lower height are black in color whereas points at higher height are white in color. This black and white variation can be captured using an encoder like resnet which can generate latent code for a given input image.

DECA Model

DECA mainly consists of two parts: coarse reconstruction and detailed reconstruction. In the coarse reconstruction part, the 3d face is reconstructed using the FLAME model, which is not that good at capturing wrinkle-level deformations of the face. Detailed reconstruction aims at combining the coarse reconstructed part with a detailed UV displacement map (UV map captures expressions using low dimensional latent representation obtained from the encoder) for facial expressions.

Encoder E_d is trained to encode input image to 128-dimensional latent code delta that represents subject-specific details.

A decoder F_d is trained that combines FLAME’s expression and jaw pose parameters with latent code delta and outputs the final 3d reconstructed face with expressions.

Pixie

It is a combination of FLAME face model and decomposition of the SMPL-X model. It uses an SMPL-X body model that captures whole body pose and shape, including facial expressions, and produces a 3D mesh. In order to improve 3d body reconstruction, they use a set of expert sub-networks for the body, face/head, hands and then combine these sub networks into a bigger network architecture for 3d body reconstruction of each part separately and then combine them using a weighted sum.

Pixie Architecture

Pixie uses different encoders for feature extraction of hands, body, and face. They treated {body, head} and {body, hand} as complimentary features and learned separate moderators for them. A moderator is basically an MLP that takes body (F_b) and part (F_f or F_h) as input features and fuses them with a weighted sum.

Moderator Equation
  • M_p(M_f/M_h) → part moderator
  • w_p(w_f/w_h) → expert’s confidence
  • F_b_p → body feature obtained by transforming the output of encoder using linear layer L_p
  • t → learned temperature weight parameter

Different regressors are used to obtain values for different parameters (camera, body rotation, pose, albedo, lightning, expressions, head pose, jaw pose, and body shapes). For capturing details, first they calculate 3D displacements on top of FLAME’s surface. Then convert the displacements from FLAME’s to SMPL-X’s UV map and apply them to PIXIE’s calculated head shape. They do this only when the face image is not too noisy, to identify that face is noisy or not using a moderator.

Examples and Applications

The methods which we discussed above have been used in many papers like Direct Volumetric Regression [Github][Paper] where they regressed 3d voxels using CNN’s, 3DDFA_v2 [Paper][Github] used 3dmm parameters with fast optimization techniques for loss functions.

There are also GAN based approaches for 3d face reconstruction, for reference you can read a recent paper Fast-GANFIT [Github][Paper].

For exploring more recent work in 3d face reconstruction, you could try reading To fit or not to fit[Github][Paper], SADRNet[Github][paper].

In this article, we discussed different techniques of 3D face reconstruction and different models that are being used for 3d body reconstruction and how these models can be used in pipelines of different reconstruction techniques to get desired output with more specialized improvements.

OffNote Internship Experience

My research internship experience was smooth under the guidance of Dr. Nishant Sinha Sir, It was the first time I read so many research papers to understand different methods to perform the same task. At first, I found it a little bit difficult as I was not able to understand research papers in one go, but after few iterations and careful studies, and making proper notes, my brain took over and started understanding things on its own. Not only I understood the theory part but also went through their codebases, logged the values of input/output of different layers and functions which helped me to understand their working.

My writing skills also improved as I started writing this article. When writing an article, we have to think from the reader's perspective. We have to order and organize the content in such a way that the reader understands it in one go. This was a thing I got to learn after writing this article. All thanks to Nishant Sir for his constant guidance.

You can view my other projects at my github and in case if you have got any questions, you can approach me on linkedin.

References

  • Landmark Weighting for 3DMM Shape Fitting [Paper]
  • SMPL: A Skinned Multi-Person Linear Model [Paper] [Video]
  • SMPL-X [Paper]
  • FLAME [Paper]
  • Skinning [1] [2]
  • DECA: Detailed Expression Capture and Animation [Paper] [Code]
  • Pixie: Collaborative Regression of Expressive Bodies using Moderation [Paper]

--

--