Demystifying KGI: Virtual Try-On with Pose-Garment Keypoints Guided Inpainting

Published in

TryOn Labs

9 min readApr 16, 2024

Virtual Try-On is revolutionizing the entire online shopping experience. It empowers customers to visualize how different clothing items would appear on them, all from the comfort and convenience of their homes, enhancing customer satisfaction by providing a more personalized shopping experience. The overall experience reduces the rate of returns, a common issue faced by online retailers.

In this article, I explain the research paper “Virtual Try-On with Pose-Garment Keypoints Guided Inpainting”, in which the authors proposed a three-step keypoints-guided inpainting approach for virtual try-on. Let’s take a closer look!

Problem

There are countless approaches for Virtual Try-On. However, existing models generate distortions in garment shape and lose pattern details. In most approaches, warping of the garment image for the target pose happens first. Later, blending the person image with the warped garment image and target segmentation map generates the final image. However, inappropriate warping of the garment image or inaccurate estimation of the target segmentation map usually results in the distortion of the garment shape. The blending process might make the clothes picture fuzzy or cause it to lose some of its pattern details.

Existing approaches like Thin-plate spline transformation (TPS) suffer from severe distortion at the overlapping part, cuff, or neckline as well.

There are two approaches to performing virtual try-on:

3D model-based— the high computational cost of 3D modeling and additional sensory devices makes it difficult to perform.
Image-based— preferred over 3D model-based approaches.

Why now?

Due to the explosion in online shopping, virtual try-on has become convenient and highly cost-effective. It allows one to visualize the fitting results without physically wearing the garments.

Proposed approach

A pose-garment keypoints guided inpainting method for the image-based virtual try-on task, which produces high-fidelity try-on images and well preserves the shapes and patterns of the garments.

Key steps:

Extract keypoints from the given garment and person images, respectively, and then construct graphs from the two sets of extracted keypoints.
Feed the graphs into a two-stream graph convolutional network to predict pose-oriented garment keypoints.
Use the predicted keypoints to perform garment warping and generate a target segmentation map. For garment warping, separate the garment into five sub-segments, namely left low, left up, center, right up, and right low, and then use the paired original/pose-oriented kerpoints to warp each sub-segment individually. The final warped garment is generated from the five warped sub-ones. This is to handle the overlapping deformation.
Generate the target segmentation map using pose keypoints, pose-oriented garment keypoints, and the source segmentation map extracted from the given person image.
Inputs for the try-on image generation are the warped garment image, the target segmentation map, the given person image, the source segmentation map, and the pose keypoints. To avoid blurring and loss of pattern details, recompose a person's image with incomplete fitting areas and pinpoint the missing regions according to the semantic segmentation maps.

Here are the main contributions of the paper:

They proposed:

A pose-garment keypoints guided inpainting method for the image-based virtual try-on task.
A graph-based model to extract the pose-oriented garment keypoints for garment warping and target segmentation map estimation.
A semantic-conditioned inpainting scheme to generate the final try-on image.
Conducted extensive experiments to verify the effectiveness of KGI and show quantitative and qualitative improvements compared with prior methods.

Methods

The authors proposed a three-step process:

Pose-Oriented Garment Keypoints Detection — Human pose and garment keypoints are extracted from source images and constructed as graphs to predict the garment keypoints at the target pose.
Segmentation Map Generation, Cloth Warping, and Person Image Recomposition — The predicted keypoints are used as guide information to predict the target segmentation map and warp the garment image.
Semantic-conditioned Inpainting — The tryon image is finally generated with a semantic-conditioned inpainting scheme using the segmentation map and recomposed person image as conditions to avoid the issue of blurring and loss of pattern details.

Let’s dive deep into the three-step process.

1. Pose-Oriented Garment Keypoints Detection

In this step, they extracted the keypoints with off-the-shelf models for human pose estimation and fashion landmark detection. The keypoints are represented with the graph structure composed of nodes and edges to better model the relationships of different keypoints.

The pose graph consists of 10 nodes and 18 edges. Nodes are the joints of the upper human body and edges are the corresponding human skeleton.

Figure 3: Illustration of pose and garment graphs.

In Figure 3, 3(a) shows 10 nodes and 18 edges, 3(b) shows 32 keypoint nodes, and 3(b) and 3(c) shows 64 edges representing the contour and 28 edges representing the symmetry structure of the garment.

They devised a two-stream graph neural network with graph convolution blocks depicted in Figure 4.

In Figure 4, the mainstream takes the garment graph as the input for nodes feature regression. The pose graph is embedded as the side stream to condition the regression task by hierarchically providing pose information to the mainstream at different feature levels.

The network is trained in a supervised manner. Lₖₚ is the loss function, and it’s a combination of nodes loss L_{N} and edges loss L_{E}.

M_{N} and M_{E} denote the number of nodes and the number of edges in graphs. xᵢ, xⱼ, x ′ᵢ and x ′ⱼ are the features of the iₜₕ and jₜₕ nodes in graph g and g ′, respectively, and aᵢⱼ is 1 if an edge exists between nodes i and j and aᵢⱼ is 0 otherwise.

2. Segmentation Map Generation, Cloth Warping, and Person Image Recomposition

In this step, they perform the following steps:

With the predicted pose-oriented garment keypoints, they generate the target segmentation map and perform cloth warping. Afterward, they recompose a person image used to be inpainted.
The garment region and part of the skin regions are removed from the source segmentation map (generated from the given person image) to produce a garment-agnostic segmentation map.
They draw sketches of the human skeleton and garment contour using pose keypoints and pose-oriented garment keypoints.
The sketches are stacked with the garment-agnostic segmentation map and fed into an autoencoder to generate the target segmentation map.

This model is trained with a cross-entropy loss in a supervised manner. The pose-oriented garment keypoints are also used for fine-grained garment warping.

Problem: TPS for garment warping is ineffective when encountering folding and occlusions.

Solution: Divide the garment into 5 sub-segments left low, left up, center, right up, and right low. Then use the paired original and pose-oriented kerpoints to warp each sub-segment individually.

Lastly, combine the five warped sub-garments and recompose the final incomplete try-on image for final inpainting.

3. Semantic-conditioned Inpainting

This is the final step. In this step, they inpaint the missing regions in the recomposed person image according to the target segmentation map and the existing pixels. A binary mask is produced indicating whether the region is kept or not during inpainting. A semantic conditioned inpainting model based on the denoising diffusion is used here.

DDPM — an image x₀ can be transformed into a white Gaussian noise by progressively adding noise in T time steps and in reverse, a noise sampled from the standard Gaussian distribution can be reconstructed to an image x₀ by predicting and removing the noise step by step.

Timesteps — [1, T]

Inpainting starts from time T. Here, x₀ is the image to be inpainted, xₜ is the noise sampled from the Gaussian distribution, m is the content-keeping mask, and s is the segmentation mask.

At each time step t > 1, the image xₜ₋₁ is the composition of x^{keep}_{t−1} and x^{inpaint}_{t−1}.

diffusion model — ϵ_{θ}

The network architecture consists of spatially-adaptive normalization to embed the segmentation map into the diffusion model.

The overall loss function of the diffusion model looks like this. Where t is a given time step sampled from [0, T], s is the semantic segmentation map, x₀, x_{t−1}, xₜ are the images at corresponding time steps, ϵ and ϵₜₕₑₜₐ are the noise and the denoising diffusion model, respectively, q and p_{θ} are diffusion process posterior and the distribution of estimations.

Experiments

They conducted experiments on the VITON-HD dataset consisting of 13679 pairs of garment and person images.
They conducted experiments under both paired and unpaired experimental settings for fair comparison with prior methods.
They used Frechet Inception Distance and Kernel Inception Distance (KID) to evaluate the quality of models by comparing the distributions of generated images and ground truths. For paired images, in addition to FID and KID, they also computed the Structural Similarity (SSIM)and Learned Perceptual Image Patch Similarity (LPIPS).

Paired Settings Results

Task: The garment region of the person image is replaced with its paired garment and the generated try-on images are expected to be similar to the original person images.

Experiments are conducted at three image resolutions: 1024x768, 512x384, and 256x192.
KGI consistently performs best in terms of SSIM, FID, and KID evaluation metrics at different image resolutions. KGI has higher SSIM scores than prior methods.
For LPIPS, their method performs better than CP-VTON, ACPGN, and VITON-HD methods at all image resolutions.
KGI well preserves the color and detailed pattern information and after the image recomposition, a content-keeping mask is used during the final inpainting.

Unpaired Setting Results

Task: fit an arbitrary garment on the given person image.

Experiments are conducted at 1024x768 image resolution
The try-on images generated by KGI well preserve the shape, color, and textures of the garment image without obvious artifacts and semantic errors.
SSIM and LPIPS are not applicable as there is no ground truth under the unpaired setting.

Conclusion

They proposed a pose-garment keypoints guided inpainting (KGI) method for image-based virtual try-on tasks, which produces high-fidelity try-on images and well preserves the patterns and shapes of the garments.

Let’s recap:

Pose keypoints and garment keypoints are extracted from the source images and constructed as graphs to predict pose-oriented garment keypoints.
The predicted keypoints are used as guide information for garment warping and the target segmentation map generation. The given person image is recomposed with the warped garment image based on the semantic information of the target segmentation map.
The missing region of the recomposed person image is finally filled with a semantic conditioned inpainting scheme

They conducted extensive experiments on the VITON-HD dataset under both paired and unpaired settings. Their qualitative and quantitative results outperform prior methods at different image resolutions.

About Tryon Labs:

We are building Tryon AI that empowers online apparel stores with Generative AI for Virtual Try-On and Cataloging. Online fashion stores spend a lot on photographers, models, and studios to create catalogs. Online shoppers sometimes struggle to pick clothes that will look nice on them.

Tryon AI cuts costs on cataloging for online fashion stores and improves the shopping experience for customers. Online fashion stores can offer a seamless and immersive way to try on clothes online from the comfort of their homes.

Check out our open-source implementation of TryOnDiffusion: https://github.com/tryonlabs/tryondiffusion

Visit our website: https://www.tryonlabs.ai or contact us at contact@tryonlabs.ai

Join our discord server: https://discord.gg/HYbupTzV7E

References:

HR-VTON — https://arxiv.org/abs/2206.14180
VITON-HD — https://arxiv.org/abs/2103.16874
DDPM — https://arxiv.org/abs/2006.11239
Clothflow — https://openaccess.thecvf.com/content_ICCV_2019/papers/Han_ClothFlow_A_Flow-Based_Model_for_Clothed_Person_Generation_ICCV_2019_paper.pdf
TryonDiffusion — https://arxiv.org/pdf/2306.08276.pdf
OOTDiffusion — https://arxiv.org/pdf/2403.01779.pdf
DCI-VTON — https://arxiv.org/pdf/2308.06101.pdf
TryOnGAN — https://tryongan.github.io/tryongan/static_files/resources/tryongan_paper.pdf
Openpose — https://github.com/CMU-Perceptual-Computing-Lab/openpose, https://arxiv.org/pdf/1812.08008.pdf
KGI Implementation— https://github.com/lizhi-ntu/KGI