Bridging the Academia Gap: An Implementation of PRNet Training

Project done by Weston Ungemach, deep learning intern at BinaryVR

hyprsense
12 min readSep 19, 2018
An example of sparse and dense feature tracking us, extracted from a fitted 3-dimensional mesh generated by PRNet. These were generated in real-time from a single frame of a video recorded on a mobile device.

Learning-based approaches in computer vision are revolutionizing the field with lightweight, accurate models for attacking longstanding questions. Authors of research in this field frequently publish their results with only partial implementations. At BinaryVR, we believe that closing this rift between academic research and practical implementation is an essential part of the integration of virtual and augmented reality into our daily lives. Here’s an example of just one project that we have been working on to bridge this gap.

Outline

(1) Project Description: Poses the main questions addressed by PRNet.
(2) Meshes and Position Maps: Introduces the core concepts for understanding our network predictions and using them to model faces.
(3) The 300W-LP Dataset: Describes the dataset we will use to train our neural network, and note its distinguishing features.
(4) Network Architecture: Summarizes the structure of the network.
(5) Training: Details the training process and hyperparameter choices.
(6) Results: Samples results of our network.
(7) Future Directions: Suggests natural extensions of this work.

Project Description

This past March, Feng et al. released their paper “Joint 3D Face Reconstruction and Dense Alignment with Position Map RegressionNetwork”, or PRNet, which proposes a new solution for real-time, 3-dimensional face reconstruction, pose alignment, and dense feature tracking from a single RGB image of a (human) face. Pictured below are a sample input and target output along with some applications, and a short animation of the output demonstrating the 3-dimensionality of the mask. Note that the input is just a single jpg.

TOP: A sample input jpg to PRNet, along with its target position map. BOTTOM: From left to right, sparse feature point prediction, dense feature point prediction, and the full 3-dimensional mesh modeling the input image, with colors from the input image.

PRNet sets new standards for speed and accuracy on 3D face reconstruction by utilizing position maps, like the colorful gradient above, for training their network. As the authors chose not to publish their training code, we decided to build and train this model ourselves to lay a foundation for future research extending these results.

A short video rotating the 3-dimensional mesh in the previous image.

Meshes and Position Maps

In computer vision, 3-dimensional objects are typically represented by a mesh, which is just a collection of vertices connected by edges and faces. The mesh models an object by assigning (x,y,z) spacial coordinates for each of the vertices. We then fill in edges connecting appropriate pairs of vertices and faces across appropriate tuples of points.

For the purposes of this blog post, we will fix a particular mesh — which we will call the face mesh — to use when modeling faces. The face mesh has 53,215 vertices. It is pictured here both modeling a face and flattened.

LEFT/CENTER: A face being modeled by the 300W-LP mesh, shown with the full mesh and wireframe. RIGHT:Top: A close up of the flattened 300W-LP wireframe. Bottom: The full 300W-LP wireframe, flattened.

Now we know that a predicted facial model is a choice of (x,y,z) coordinates for each vertex in the face mesh, but how should we organize this data to be predicted by a neural network? One way would be to just list all of our 53,215*3 predictions of (x,y,z) for each vertex into a long column vector and tell our network to predict that. This approach, however, obscures the fact that nearby vertices are mapped to nearby locations, making learning more difficult. Concretely, this deficiency would be represented in the neural network architecture by some fully-connected layers to flatten out our input image. We would prefer to have an approach that only used convolutional layers, which will remember which vertices are “nearby” each other, and thus should have similar predictions.

LEFT: A face modeled by the 300W-LP mesh; vertices are colored by their (x,y,z) values. This mesh “flattens out” to the color gradient in the bottom. right. RIGHT:Top: The flattened 300W-LP mesh. Middle: The flattened 300W-LP mesh with vertices colored by their (x,y,z) values. Bottom: The middle image, with interpolated color values on faces.

To arrive at such an approach, we do the following: Consider the initial flattened face mesh. The output of our neural network will specify for each vertex in this flattened picture the (x,y,z) coordinates of that point in the mesh. Let’s paint the flattened picture with these coordinates, where we let the color channels (r,g,b) hold our predicted (x,y,z) location for that point. We can then interpolate across edges and faces to fill in the colors across the entire flattened mesh. Fixing a resolution of [256, 256] for this painted mesh, we can interpret this prediction as an array of shape [256, 256, 3]. This object — called the position map of the fitted mesh — will be the raw output of our neural network.

The fact that this output has two spacial dimensions will allow us to rely solely on convolutional layers in the network architecture described below.

One quick technical note: Until now, we have been pretty cavalier about this “flattened” version of the face mesh, but there is actually a bit of subtlety here; the mesh is formally just a combinatorial object designating the vertices, edges, and faces without this additional flattening information. In the paper, a particular flattening — or uv-map, as it’s called in computer graphics — is chosen. There are lots of interesting ways to build uv-maps, but we would like to choose the most symmetric one that, in some sense, preserves the structure of the data. There’s a good mathematical way of doing this using something called the Tutte embedding. A good amount of work in the implementation of this paper involves understanding how to use the Tutte embedding to arrive at a good uv-map, and from there to do the interpolation procedure described above to arrive at the final smoothed-out position maps.

Some sample images from the 300W-LP dataset, described below.

The 300W-LP Dataset

If we want our neural network to take in images and predict position maps for the faces in them, we, of course, need to be training it on (image, position map) pairs of that form. We work with the 300W-LP dataset, which contains 61,225 in-the-Wild images of human faces — possibly in Large Pose (more on that later) — along with fitted face models using the face mesh discussed above. We can perform the above flattening-and-coloring procedure to each mesh in this dataset to generate a position map for each image, which we can use to train the network described in the next section.

Before getting to that, we want to highlight two novel features of the 300W-LP dataset:

(1) 300W-LP includes faces in a variety of poses, many of which are at a large angle relative to the camera. These large pose images were synthesized from small/medium pose ones by creating a 3-dimensional model of the entire scene of the image and then rotating this 3-dimensional object in space. See the image below for an example. Details on this rotation process can be found here.

A sample image from the 300W-LP dataset, along with some of it’s synthesized rotations.

Note that in these large pose images, much of the face is occluded. The fact that our labels reconstruct the entire face — including the occluded part — means that we will be able to make predictions about unseen portions of the face in addition to the visible parts. This is atypical in facial feature tracking, due to the challenges presented by accurately and efficiently hand-labeling occluded features.

(2) The fitted meshes in 300W-LP not only model the faces in their corresponding images, but also align with the pose of the face in the image. This relates the coordinates of our prediction to those of the original image, which allows for a range of useful applications. Pictured below, the map down onto the image is given by orthogonal projection (i.e. forgetting the z-coordinate), which allows us to, for example, mark feature points on the image. We can also use this projection to pull data from the image back on to the mesh, like color information (see the Project Description section for an example). In the image below, the colors on the mask are given by placing the spacial (x,y,z)-coordinates in the (r,g,b) color channels.

An example from the 300W-LP dataset along with its modeled face, along with projections of the facial mask and of feature points onto the image, which lies in the xy-plane.

(Note: The faces are upside down in the images to standardize them for a convention about array indexing in Python. We have left them this way so that you can visualize the correspondence between the (x,y,z) coordinates and the (r,g,b) values. For example, as you move up the green y-axis, the pixels are greener.)

Network Architecture

So far we have said a lot about data preparation, but nothing about the actual neural network we are going to train with the data. The network has a sleek encoder/decoder architecture, as in the following image from the original paper.

A diagram of the PRNet architecture. Green rectangles each represent residual blocks of convolutions with a skip connection, and blue rectangles each represent single transpose convolutions.

Here, the green rectangles represent residual blocks, each consisting of three successive convolutions with 4x4 filters followed by a skip connection. The blue rectangles represent transpose convolutions, again with 4x4 filters. The strides of these convolutions roughly alternate between being one and two, giving the network its hourglass shape. For details, see section 3.2 of the original paper.

The loss function, which measures how far our predictions deviate from the ground-truth labels, is a weighted mean-squared error. That is,

The loss function for PRNet.

Here, pos is the predicted position map, \hat{pos} is the ground truth position map, i and j each range from 0 to 255, the vertical bars denote the L² norm (applied across the channel dimension), and W is the weight mask, a [256,256] array of integers. This weight mask feeds the network a per-pixel measurement of how much we value learning that particular pixels location. The weights are given as follows.

The weight mask for the PRNet loss function.

The bright white feature points here carry the greatest weight of 16, the eyes-nose-mouth region carries a weight of 4, the rest of the face carries a weight of 3, and the neck is given 0 weight, and thus is not learned.

Another technical note: The astute reader will notice that the 300W-LP mesh models the neck region, but our network isn’t learning to predict it. Thus, we cannot possibly be learning to predict face models built from the 300W-LP mesh. There were actually several different meshes floating around during this project, some of which we defined ourselves, which all serve different functions. The 300W-LP mesh models the ground-truth labels. Predictions from the network are built from the predicted position map using a different mesh. This new mesh has 43,215 vertices (rather than 300W-LP’s 53,215) and was used in our first image to give a neck-less sample network output.

A related question is: why aren’t we learning to predict the neck region? We did a bit of experimenting with non-zero weightings of the neck region, but were met with poor results; the network wasn’t learning accurate neck predictions and the convergence elsewhere was slower. We initially found this a bit surprising — shouldn’t telling the network to predict this region just be giving it better information from which to make predictions about the rest of the face? How could it hurt? The answer is the following: The neck region is very poorly represented in the data. It is often occluded or absent from the image and the data synthesis process often blurs this region pretty badly. All of this together just makes the prediction task much harder for the network.

Another key step in formatting the data for training was cropping. The images in different parts of 300W-LP contain images of different resolutions, but the network needs to take in an image with spacial dimensions [256, 256]. For this, we crop the input images by using their target mask to construct a bounding box. This bounding box is then dilated to [256, 256]. We then need to modify the coordinates of the position map as well, to preserve the alignment for the projection detailed at the end of the previous section.

We note that we chose to take a slightly larger bounding box than the original authors, as this guaranteed that the entire face lay in the crop and provided a bit more room for the face to move around the image.

Training the Network

The network was trained on a single GPU with a batch size of 20 and approximately 300k gradient descent updates, which is about 50 epochs. The learning rate began at 0.001, and was halved every 75k steps. We employed data augmentation as in the paper — including translation, rotation, dilation, color scaling, and occlusion — but with slightly tamer parameters than those outlined in the paper. This helped to speed up convergence and produce more reasonable results.

We removed a validation set containing 5% of the images in 300W-LP before training. As we were not exploring different network architectures but just tuning this one, we did not extract a test set.

There was some interesting behavior observed during training indicating that the network learned in one of two ways, depending on initialization. Sometimes, the network would first learn the bowl of the face, then start adding features like the nose, lips, and brow line. Other times, it would first collapse the whole mask to a line down the center of the face, then it would predict a flattened, planar mask, and eventually begin predicting depth information as well. The latter process actually took much longer to converge, as it seemed to be at first memorizing symmetries rather than actually examining the inputs.

Results

Here we provide a sample of our results. In all cases the various position maps, feature points, and masks have been made from our trained model.

In the position maps below, you can make out the eyes, nose, and mouth. The neck region at the bottom of the predicted position maps seems to display qualitatively different patterns depending on if the face is right or left facing (remember that this region wasn’t learned).

Each [input, ground truth position map, prediction] triple here was taken from the validation set. Triples like these were logged to TensorBoard during training for debugging.

The following gives some examples of feature point predictions for images synthesized from 300W-LP. Note that minor occlusions are handled well, except for the one case where the occlusion happens to match skin color.

This selection of images comes from the validation set inside of 300W-LP. Note that the results do well across various poses and minor occlusions, unless they happen to match skin color.
Each [sparse feature point, ground truth mask, predicted mask] triple here was taken from the validation set.

As you can see above, the network predicts accurate and detailed masks for faces in a variety of poses. Interestingly, the network actually has the most difficulty with head-on poses, because most images in 300W-LP are at a medium or large pose. The checkerboard pattern that you can make out on some of our predictions is a well-documented consequence of the stride-two transpose convolutions in the network architecture.

The following short videos demonstrate the real-time sparse and dense feature tracking. This raw video was recorded on a mobile device and run frame-by-frame through our network on a single GPU for inference, without any averaging or smoothing. A forward pass through the network takes about 20ms.

A sample of real-time sparse feature tracking.
A sample of real-time dense feature tracking.

Future Directions

This work suggests a wealth of interesting applications. Here are just a few that we plan on thinking about:

  • Compress the neural network model with bit-wise quantization, singular value compression, or other methods to enable real-time inference on a mobile device.
  • Integrate features like eye-gaze and tongue tracking using tools and insights from our existing applications.
  • Collect and synthesize more data to improve tracking during large mouth poses.
  • Use the dense tracking for real-time feature transfer from one image to another.

Final Thoughts

PRNet offers a powerful, lightweight tool for real-time, three-dimensional facial reconstruction. While the model itself is not particularly novel, the new technique of position map regression could allow for the movement of all sorts of applications from large machines to laptops and smartphones. At BinaryVR, we believe that this marks an important step in the mainstream adoption of virtual and augmented reality, and are excited to continue this research and lead the way.

Acknowledgments

Being entrusted with a research project of this scale and impact — especially as an intern — cannot be a common experience. From the start, BinaryVR gave me the freedom to set my own goals and pursue my own solutions, all while having a great team ready to work with me through sticking points. I cannot thank them enough for the opportunity.

References

Below are some relevant references, if you are looking to dig further into the details.

  • “Joint 3D Face Reconstruction and Dense Alignment with Position Map Regression Network”: paper and code
  • “Face Alignment Across Large Poses: A 3D Solution”: paper and code/data
  • “3D Morphable Models as Spatial Transformer Networks”: paper and code
  • The Basel Face Model, which the face mesh discussed above is derived from, can be found here.

Explore open positions: https://angel.co/binaryvr/jobs
Send your resume for the internship: contact@binaryvr.com
Learn working at BinaryVR: What Made Engineers from Tech Giants Gather at a Small AI Startup?

We are BinaryVR; aiming for seamless interaction between AI and people’s daily lives in the computer vision field. We develop the world’s top quality facial motion capture solutions, HyprFace and BinaryFace, keeping our core value in constant evolution.

--

--

hyprsense

Hyprsense develops real-time facial expression tracking technology to light up the creation of live animation.