Demystifying DCI-VTON: Diffusion-Based Conditional Inpainting for Virtual Try-On

Published in

TryOn Labs

8 min readMar 27, 2024

With the rising popularity of Diffusion models in this era of Generative AI, Virtual Try-On is an application that has attracted many researchers. It synthesizes new composite images by combining the characteristics of different images thus providing a more comprehensive virtual experience for consumers. This is a complex task with a variety of challenges such as generating the required garment in the correct orientation, preserving its finer details, and retaining the characteristics of the target human to generate a realistic output.

The diffusion model has recently surfaced as a promising alternative for producing top-tier images across a range of applications as compared to Generative Adversarial Networks (GANs) which suffer from issues like mode collapse. This article aims to concisely explain the Diffusion-based Conditional Inpainting for Virtual Try-ON (DCI-VTON) mechanism proposed in the paper “Taming the Power of Diffusion Models for High-Quality Virtual Try-On with Appearance Flow” (paper link = https://arxiv.org/pdf/2308.06101.pdf) which seeks to employ diffusion models to achieve virtual try-on by framing it as an inpainting task.

The official GitHub implementation link for this paper: https://github.com/bcmi/DCI-VTON-Virtual-Try-On/tree/main?tab=readme-ov-file

DCI-VITON utilizes 2 main models to produce virtual try-on outputs:

Warping Network

This model is responsible for warping the cloth image so that its orientation aligns with the pose and physical characteristics of the human. This is done using a “Parser Based Appearance Flow Network (PB-AFN)” which generates a set of coordinate vectors, each indicating which pixel in the cloth image should be used to fill in a given pixel in the human image. [PB-AFN paper link: https://openaccess.thecvf.com/content/CVPR2021/papers/Ge_Parser-Free_Virtual_Try-On_via_Distilling_Appearance_Flows_CVPR_2021_paper.pdf]

In DCI-VITON, the inputs for the warping module are the cloth image I𝒸 along with the Human Segmentation Map Sₚ (this segmentation map is cloth agnostic to prevent the cloth in the input human image from creating interference) and the Densepose map P concatenated along the channel dimension.

Overview of the method described in the DCI-VTON paper. Taken from Fig. 2 of the paper: https://arxiv.org/pdf/2308.06101.pdf

The underlying module of PB-AFN is an Appearance Flow Warping Module which utilizes two modules, namely a dual branch Pyramid Feature Extract Network (PEFN) and an Appearance Flow Estimation Network (AFEN). PEFN extracts two branches of pyramid features, one for I𝒸 and another for Sₚ & P, across N levels. AFEN contains N Flow Networks, where the iᵗʰ network takes in the extracted pyramid features at the iᵗʰ level as well as the flow output of the (i-1)ᵗʰ Flow Network to produce a more refined flow. The cloth image is warped according to the Nᵗʰ flow output.

DCI-VITON utilizes the Total Variation Loss given by:

Where Fᵢ is the iᵗʰ flow output. The total variation loss helps smoothen the warped result despite the high degrees of freedom in the appearance flow mechanism.

To preserve the finer details of the cloth image such as repeating patterns or text, a Second-Order Smoothing Constraint is added to the loss function:

Where Fᵢᵗ is the tᵗʰ point in Fᵢ, Nₜ is the set of horizontal, vertical, and both diagonal neighbors around the tᵗʰ point. 𝒫 denotes the generalized charbonnier loss function.

Additionally, the Perceptual loss and L₁ loss are also utilized in the final loss function given by:

λ_VGG, λ_TV, and λ_sec are hyperparameters that dictate the contributions of each of the individual loss functions. DCI-VITON sets these values to 0.2, 0.01, and 6 during experiments.

Where M𝒸 indicates the mask of I𝒸(the cloth image) and S𝒸 is the clothes mask of Iₚ (person image) respectively, W is the warping function, D is the downsampling function and Φₘ indicates the mᵗʰ feature map of a pre-trained VGG-19 network.

Diffusion Model

Given a person image Iₚ and cloth image I𝒸, the aim of the diffusion model is to combine them into a realistic image that has the same person attributes as Iₚ while keeping the cloth element from I𝒸.

The diffusion model training pipeline described in the paper is divided into 2 branches namely the Reconstruction branch and Refinement branch, both of which are optimized simultaneously during training.

Reconstruction Branch

The aim of the reconstruction branch is to learn how to generate the complete person wearing the desired cloth from a Gaussian Distribution. It operates akin to the standard diffusion model, employing a reverse diffusion process to generate realistic images. The authors use a Latent Diffusion Model (LDM) which operates in the latent space rather than pixel space in order to significantly reduce the computational complexity during training as well as allow for a provision to add conditioning using mechanisms like Cross-Attention.

The ground truth image (I₀) is converted to a latent representation (z₀) using an Encoder, z₀ = E(I₀). The corresponding Decoder of the same Autoencoder can be used to reconstruct the pixel-space images from the latent representation. The main diffusion steps of noising and denoising are performed on this latent space representation.

The forward noising process is then performed on z₀ as per

𝛽 is a pre-defined variance schedule in 𝑇 steps.

In order to maintain the characteristics of the cloth image as well as information regarding the desired orientation of the cloth on the person, a local condition is provided as a guide at each denoising step. These local conditions are created by adding the warped image to the inpainting image to obtain I_lc, which is transformed to the latent space as z_lc = E(I_lc). The denoising is done by an enhanced Diffusion UNet which takes {zₜ, z_lc, m} concatenated along the channel dimension as input. Here m is the inpainting mask spanning the torso and arms region and it is used to inform the model about the regions it needs to focus the denoising on. This allows the model to address poor warping results and irregularities in areas where the cloth and body meet. Moreover, a global condition c, obtained from I𝒸 via the frozen pre-trained CLIP Image Encoder, is incorporated by the UNet via Cross-Attention to emphasize the retention of cloth image features. Therefore, the reconstruction branch uses the following loss function:

where ϵ_θ is the noise predicted by the model.

While the reconstruction branch can eventually generate a synthetic image of the person and the general characteristics of the cloth, the refinement branch is required to capture the finer details of the cloth.

Refinement Branch

The Refinement Branch helps capture details about the cloth such as thickness, orientation, and color of patterns as well as their spatial information. In order to do this, while ensuring that the final results conform to the results generated by the reconstruction branch, the refinement branch utilizes the coarse result (combination of warped cloth Ĩ𝒸 and cloth agnostic Iₐ ) as initial condition I₀’ along with the local and global conditions used in the Reconstruction branch.

The latent space representation of I₀’ is obtained as z₀’= E(I₀’) after which the forward noising process is conducted to obtain zₜ’. The UNet diffusion model then takes {zₜ’, z_lc, m} concatenated along the channel dimension as input and outputs the predicted noise ϵ̂. The refined denoised latent variable ẑ can be obtained by denoising using the reverse equation of the noising process. The pixel space representation of the final predicted image can be obtained as Î = Ɗ(ẑ) where Ɗ is the Decoder function.

The loss function for the Refinement Branch is defined as:

where Igt denotes the ground truth image

On combining the losses, the main objective function used for training the diffusion model becomes:

where λ_perceptual is a hyperparameter used for balancing the 2 loss components.

Experiments

Dataset

The authors have performed extensive benchmarking tests of their proposed model on the VITON-HD dataset which contains 13,679 frontal-view woman and top clothes image pairs at the resolution of 1024×768.

Training the warping module

100 epochs using Adam optimizer at a learning rate of 5x10–5.
λ_VGG = 0.2, λ_TV =0.01, λ_sec = 6
Trained under 256x192 resolution.

Training the diffusion model

Latent space with spatial dimension of c(H/f)(W/f), where channel dimension c=4.
AdamW optimizer at a learning rate of 1x10–5.
λ_perceptual= 1x10–4
Trained on 2 NVIDIA Tesla A100 GPUs for 40 epochs.

Here are some of the impressive results that the authors of the paper have managed to achieve:

Taken from Fig. 5 of DCI VTON paper: https://arxiv.org/pdf/2308.06101.pdf

The above image depicts how well the model can fit the cloth image onto the human’s original image while preserving the pose and the details of the garment.

The following image further highlights the model’s ability to preserve the finer details of the garment like striped patterns while warping it as per the human’s pose.

Taken from Fig. 4 of DCI VTON paper: https://arxiv.org/pdf/2308.06101.pdf

Special credits to Unnikrishnan Menon and Anirudh Rajiv Menon for compiling this article.

About Tryon Labs:

We are building Tryon AI that empowers online apparel stores with Generative AI for Virtual Try-On and Cataloging. Online fashion stores spend a lot on photographers, models, and studios to create catalogs. Online shoppers sometimes struggle to pick clothes that will look nice on them.

Tryon AI cuts costs on cataloging for online fashion stores and improves the shopping experience for customers. Online fashion stores can offer a seamless and immersive way to try on clothes online from the comfort of their homes.

Check out our open-source implementation of TryOnDiffusion: https://github.com/tryonlabs/tryondiffusion

Visit our website: https://www.tryonlabs.ai or contact us at contact@tryonlabs.ai

Join our discord server: https://discord.gg/FuBXDUr3