Learning to Dress 3D People in Generative Clothing

Mahyar fardinfar
4 min readJan 31, 2024

--

the paper can be found here

Abstract and Introduction

Recently, 3D human modeling has become a vital task. While there are many models that perform well in this task, they often lack the ability to render clothes in an authentic way.

This paper aims to solve this problem by utilizing variational auto-encoders, graph neural networks, and convolutional neural networks. The proposed model is a conditional VAE-GAN, which is conditioned on pose and clothing type. To achieve this, the authors have developed a factorized model for wearables, relative to the body. However, challenges arise when it comes to shoes or jackets, which are independent of body shape and cannot be easily expressed relative to specific body parts (e.g., a shoe tip relative to toes).

Furthermore, unlike other models that work deterministically, this model employs a one-to-many approach, with one pose generating multiple clothing combinations, thus enhancing its ability to generate a wide range of clothing combinations.

Contributions

This paper did several contributions, including:

  1. A stochastic clothing system that produces high variance characters.
  2. A model called MESH-VAE-GAN, which is a combination of GANs, GNNs, and VAEs.
  3. The ability to deform poses and generalize clothes in a natural way by defining clothes dependent on pose.
  4. The augmentation of the SMPL model with different clothes for each pose.
  5. The creation of a dataset consisting of scanned targets in both minimally clothed and fully clothed forms.

Technical Details

  1. Dressing

In this study, the authors defined the body mesh as “beta,” pose as “theta,” and the SMPL model as a triangular mesh with 6890 nodes (referred to as “T”). By employing specific formulas, they were able to generate a person in a rest posture with minimal clothes. When passing this mesh through a second formula, the joints are rotated, and the posture is deformed (where “J(beta)” represents the joints).

The function is further extended to include a displacement function for clothing, based on latent space “z,” pose, and clothing type. The clothing types are defined, such as upper body and lower body, for example, a “short-short.” By sampling from the latent space, different combinations of clothing can be obtained. The output of this layer is then passed to the skinning layer to deform the target.

2. Clothing Representation

The authors represent clothing as a graph, denoted as G = (E, V), where V represents the 3D space (XYZ coordinates). The model is trained on (V_minimal, V_clothed), where V_minimal is nearly the vertices that form the body of the target. To calculate the displacement (relative representation of clothes), the authors simply subtract these two.

3. Network

The proposed network consists of a generator (G), a discriminator (D), and two convolutional networks (c1 and c2).

— Generator

The generator is a graph neural network that takes a displacement map (representing clothes) and aims to reconstruct it. The discriminator detects any inconsistencies in the reconstructed clothing, such as unrealistic wrinkles. To address the loss of local features, particularly small wrinkles, the authors feed patches of the GAN’s output to the discriminator. At the same time, the global shape of the clothes is taken care of by the generator’s loss.

— Loss

Given the complexity of the network, the loss function is complex as well. The first term in the loss function is for reconstruction, which helps preserve the overall quality of the clothes worn by the target. It is a simple L1 loss between the ground truth vertices and the predicted vertices. This loss encourages the model to produce sharp edges in the output, which is beneficial for the generation of small wrinkles. The next term is an L2 distance between edges, which promotes the production of soft textures for the cloth. The third term corresponds to the upper bound of the latent space distribution, assuming a normal Gaussian distribution for the VAE component. The final term is for the GAN component, measured by the entropy of both the generator and the discriminator. The generator aims to produce real images with low entropy, while the discriminator aims to detect fake data. The first term signifies the discriminator’s desire to detect real images, while the second term represents the generator’s intent to deceive the discriminator with its output.

GAN loss
Main loss term

4. Dataset

The authors created a dataset consisting of 80,000 3D scans of male and female targets in two conditions: minimally clothed to represent their body, and fully clothed combinations categorized by keywords such as long-long, etc.

Some interesting instances of models output

--

--