A Journey through Warping Modules in Virtual Try-on: From Explicit to Implicit

Published in

TryOn Labs

8 min readApr 10, 2024

Introduction:

The problem of Virtual Try-On has recently come up in the limelight especially with the rise of Diffusion models, many researches have been conducted with different approaches, some with Generative Adversarial Networks (GANs) and others with Diffusion models, both claiming to solve the problem of Virtual Try-On. Most of these models are being trained upon a popular dataset called VTON-HD as a base dataset. The objective of this write up is to provide technical explanation, comparison between different approaches and their merits and demerits, lastly to give an oversight in the paradigm of Virtual Try-On and researches being conducted.

Overview:

There are multiple research papers which have shown promising results on paper as well as in practice, one such research paper is Taming the Power of Diffusion Models for High-Quality Virtual Try-On with Appearance Flow(11th August 2023),which uses the Diffusion model to tackle this problem of Virtual Try-On. A model called DCI has been proposed which is trained on VTON-HD dataset and the results have surpassed previously proposed methods. The comparison below illustrates the claims

Fig. 1. Result comparison between various methods. (Source)

From above we can see that there are mainly two problems which arise while executing Virtual Try-On i.e warping the garment properly and persevering the texture of the garment. The above image is taken from Taming the Power of Diffusion Models for High-Quality Virtual Try-On with Appearance Flow itself.

GAN based model though preserve the exact texture and text in the garments, but struggles to generate occluded areas, like unseen segments of garments and human body parts While on the other hand Inpainting by Diffusion based Methods can preserve texture of the garment to some extent but again are in perplexity while handling the colour accuracy and warping. Henceforth, DCI comes as winner in such a case as it is able to warp the garment without distorting it’s texture

Warping Modules

1. DCI Warping Module

The DCI (Dress Code Inference) model utilizes a specialized architecture to handle the complexities of virtual try-on tasks by dividing the process into distinct modules. The first module is the Warping Module, which is primarily focused on manipulating the garment image to fit the target pose of the person. This module takes the original garment and performs warping to align it with the person’s body shape and pose. After the garment has been successfully warped, it is then passed onto the second module.

The second module is the Refinement Branch, where the warped garment undergoes further processing to ensure a natural and seamless integration onto the person. This involves inpainting and refinement using a Diffusion-based Model to address any imperfections or mismatches between the garment and the person’s body. The combined use of these two modules enables DCI to effectively address the challenges of garment warping and blending, resulting in high-quality virtual try-on outcomes.

*Fig. 2. Figure explains working DCI Network (Source*)

2. KGI Warping Module

The KGI (Keypoint Guided Inpainting) model presents an innovative approach to the virtual try-on problem by introducing a Warping Module that addresses the limitations of traditional thin-plate spline (TPS) transformation techniques. Unlike TPS, which often leads to severe distortion, especially in areas of overlap, KGI’s warping module strategically divides the garment into five sections. These sections are then individually adjusted and reassembled to create a warped garment image that preserves the integrity of the original garment, avoiding the common issues of distortion, blurring, and loss of detail.

This approach is particularly effective in handling complex areas such as cuffs and necklines, where TPS transformations typically falter. The Warping Module operates in tandem with a graph convolution network that includes two streams: a main stream that processes the garment graph and a side stream that embeds pose information. This structure enables the KGI model to achieve a high level of precision and quality in the virtual try-on process.

Therefore, we can conclude that such an approach does in fact yield better results. Moreover, KGI claims to fill a gap that exists in thin-plate spline transformation (TPS) as it suffers from severe distortion at the overlapping part; other distortions may happen at the cuff or neckline. Moreover, the blending procedure may also blur the garment image or lead to the loss of pattern details. We can see the difference in the figure below

Fig. 4. Occlusion handling in KGI warping module (Source)

As we can understand from the figure that TPS distorts the area with occlusion and whilst the warping module of KGI splits the garment into 5 parts and recombines them to form the warped garment, preventing the distortion. To understand how the warping module works in KGI

*Fig.5. Figure shows how pose keypoints are changed using conv layers (Source*)

The figure will help us see how it works in depth, The network consists of two graph convolution streams. The mainstream takes the garment graph of a given garment image as the input and outputs the garment graph at the target pose. The side stream takes the pose graph as input and hierarchically embeds the pose information and provides conditions to the mainstream. However, there is one more research paper namely

3. TryOnDiffusion Warping Module

TryOnDiffusion introduces a novel concept in the realm of virtual try-on technology with its parallel UNet architecture. This architecture eschews the conventional two-step process of warping the garment to the target body followed by blending. Instead, it integrates both steps into a single pass through the use of two UNets: one for the garment and another for the person.

The garment-UNet focuses on processing the segmented garment image, optimizing its operations to reduce parameter count by stopping after a specific upsampling block. Meanwhile, the person-UNet is tasked with handling the more complex integration of the garment with the target person. It receives a combination of a clothing-agnostic RGB image and a noisy image, merging them at the channel dimension for precise alignment.

Both UNets leverage separate pose embeddings, which are then fused into the person-UNet through an attention mechanism. These embeddings, along with feature modulation across various scales via FiLM and the incorporation of positional encoding and noise augmentation, enable TryOnDiffusion to perform implicit warping and blending. This unified approach allows for a more streamlined and efficient process, delivering high-quality virtual try-on results without the need for explicit flow computation.

Fig. 6 Parallel UNet concept first introduced in TryOnDiffusion by Google, which handles warping implicitly (Source)

The system utilises two UNets, one for handling the garment and the other for the person.
The person-UNet receives inputs of clothing-agnostic RGB Ia and noisy image zt, concatenated at the channel dimension due to pixel-wise alignment.
The garment-UNet takes the segmented garment image Ic as input and stops after the 32×32 upsampling block to save parameters.
Pose embeddings for both person and garment are computed separately and fused into the person-UNet using an attention mechanism.
Pose embeddings are also used to modulate features across all scales using FiLM, along with positional encoding of diffusion time step and noise augmentation levels.

Comparison:

The difference between the approaches lies majorly in the garment warping except in the Tryon Diffusion paper which combines both the processes in a single pass.

In DCI and KGI both, use different models to handle warping and putting it on the person which seems to be a logical appealing approach. The DCI and both KGI both maintain the garment texture and warping by use of such approach but DCI does suffer from issues generating neck area and bottom end of garments as it is not trained on such a data by default and it also faces issues when facing occluded garments, this is where KGI comes in the picture it uses a better warping module and technique to handle occluded garments as shown in Fig. 4 above.

Now, if we take a look at Tryon Diffusion it blends the two processes as shown in Fig. 6. During the preprocessing step, the target person is segmented out of the person image creating a “clothing agnostic RGB” image, the target garment is segmented out of the garment image, and pose is computed for both person and garment images. These inputs are taken into 128×128 Parallel-UNet (key contribution) to create the 128 × 128 try-on image which is further sent as input to the 256×256 Parallel-UNet together with the try-on conditional inputs. Output from 256×256 Parallel-UNet is sent to standard super resolution diffusion to create the 1024×1024 image. The architecture of 128×128 Parallel-UNet is visualised at the bottom, see text for details. The 256×256 Parallel-UNet is similar to the 128 one, and provided in supplementary for completeness

Reason Behind Parallel UNet:

Even though the latest clothing try-on models are successful, they still use a typical UNet design with a method called channel-wise concatenation for preparing images. This method works great for tasks like making images clearer, filling in missing parts, and adding colour to images, where the input and output match perfectly. However, it doesn’t work as well for our task because trying on clothes involves complex changes like stretching the clothing. To tackle this problem, we suggest a new design called Parallel-UNet specifically made for trying on clothes. In this design, the clothes are stretched automatically using special attention mechanisms.

Muneeb Mushtaq

Deep Learning Engineer, TryOn Labs

About Tryon Labs:

We are building Tryon AI that empowers online apparel stores with Generative AI for Virtual Try-On and Cataloging. Online fashion stores spend a lot on photographers, models, and studios to create catalogs. Online shoppers sometimes struggle to pick clothes that will look nice on them.

Tryon AI cuts costs on cataloging for online fashion stores and improves the shopping experience for customers. Online fashion stores can offer a seamless and immersive way to try on clothes online from the comfort of their homes.

Check out our open-source implementation of TryOnDiffusion: https://github.com/tryonlabs/tryondiffusion

Visit our website: https://www.tryonlabs.ai or contact us at contact@tryonlabs.ai

Join our discord server: https://discord.gg/FuBXDUr3