Vestiaire Collective

How to create your own AI background removal tool?

Lessons learned from building a clipping engine internally

Aurélien Houdbert

Published in

Vestiaire Connected

9 min readSep 19, 2023

Introduction — Why you should clip images

Images are the heart of Vestiaire Collective. Thanks to images, potential buyers are able to assess the quality, color and condition of products. This makes image quality and consistency critical to user engagement and conversion rate.

When browsing our apps and website, you may have noticed that all main item pictures have no background on Vestiaire Collective, thanks to a third-party background removal tool to clip them. Some advantages of using a background removal tool include:

Luxury look and feel of the platform
The background becomes less distracting
Consistency & better image quality
Better assessment of the color and quality of the product

*Products for sale on Vestiaire Collective’s website*

While there are many existing providers in the market, building your own tool can save costs and provide a customized solution. This is what inspired us to explore how we could create our own background removal deep learning model. In this article, we will share our experience with building a background removal tool at Vestiaire Collective, including the challenges we faced and the solutions we found.

Understanding the basics of background removal — A quick review of available solutions

Removing the background from an image may seem like a simple task but it’s actually quite challenging. The background of an image can be composed of different objects, textures, and colors, and it can be difficult to distinguish between the background and the foreground objects (even for human eyes).

Background removal is a highly researched subject in AI and recently gained attention with the rise of deep learning models. We can distinguish two different types of approaches: Instance Segmentation and Salient Object Detection (SOD).

Instance Segmentation

Is a technique that specializes in identifying and labeling objects in an image, while also segmenting them from the background. Mask-RCNN is a good example of an instance segmentation network.

Salient Object Detection (SOD)

Is a technique that identifies the most visually significant object(s) in an image and separates them from the background. The goal of Salient Object Detection is to highlight the most important parts of an image, which are typically the objects or regions that draw the viewer’s attention. UNet or U2Net are great examples of SOD networks.

After a quick review of state of the art models, we identified U2Net models as the most effective approach for our use case. Indeed, U2Net (a SOD model) only needs precise segmentation masks and doesn’t require an object label to segment images. SOD were designed to accurately delineate foreground from background, contrary to Instance Segmentation that is optimized to locate the object in the image. Also, we wanted a model as general and as flexible as possible, and one that is able to handle unseen labels during training.

Collecting the data — Data pre-processing is your best friend

As mentioned in the previous section, in order to train U2Net properly, we need to build a dataset of images paired with their corresponding segmentation masks.

Original image and its corresponding segmentation mask

Good news: Vestiaire Collective has been clipping images for years with the help of a third party provider. This means we have access to an almost unlimited source of images. 🎉

Bad news: At Vestiaire Collective, all images are stored in JPEG format, clipped images are cropped, centered and saved with a white background.

So why is that an issue ?

The white background makes it difficult to extract a clean segmentation mask from the clipped images. If you try to set the white pixels as background, you may end up with “holes” in the object you are trying to extract if it contains white parts.
Because clipped images were cropped and centered, they are no longer aligned with the original image. During the training of U2Net, we need pixels between mask and image to be perfectly aligned.
JPEG images are compressed versions of original images, which will result in poor mask quality around edges.

To solve these issues we had to go through two main pre-processing steps:

Image Stitching

Is the process of combining multiple overlapping images to create a single, wider image. This is achieved by identifying common features in the images and aligning them to create a seamless, panoramic view. In our case, the clipped image was directly extracted from the original picture so this technique works pretty well to realign images.

Mask Refinement

Is used to remove artifacts arising because of the white background removal and the JPEG format. To remove these artifacts, we use a smoothing technique (Gaussian blur) combined with morphological transforms (combination of dilation and erosion).

These techniques proved to be efficient but were not sufficient to recreate blindly a clean dataset. About 1/3 of images were of poor quality, with “white holes” too large to be corrected by the mask refinement step. In the end we still needed to review our dataset manually.

Thanks to these methods we were able to build a 5K images dataset!

Notes: Later in the project, we identified that this mask refinement issue was not scalable as we needed more data to reach even better performances. Our final dataset contains only PNG images coming from our manual clipping provider here at Vestiaire. This first 5K JPEG images dataset provided very encouraging results that gave us traction with stakeholders and helped kick-start the project.

Also, even if our final dataset was built using a different source of data (png format with no background), we still used the stitching method to realign clipped images with the originals.

Building the model — U2Net

As mentioned earlier, the model we selected is U2Net. Their paper and code can be found here:

GitHub - xuebinqin/U-2-Net: The code for our newly accepted paper in Pattern Recognition 2020…

The code for our newly accepted paper in Pattern Recognition 2020: "U^2-Net: Going Deeper with Nested U-Structure for…

github.com

We followed the implementation of the paper and used the exact same training settings. We trained the model from scratch on a dataset of 18,000 hand curated images.

When building the model, we made several general observations:

Better dataset quality led to better results than bigger dataset of lesser quality. Of course more data of better quality would result in even greater performances!
The model is “only” 40 million parameters. The pre-trained model, provided by the research team, was trained on a dataset of 10,000 images. Retraining from scratch is a viable option if you have enough training data. It is particularly useful if your custom dataset is significantly different from the pre-trained dataset (which was our case because our data is fashion items and the pre-training data is DUTS-TR).
e.g. In our case, we wanted to remove human body parts from images, but the pre-trained dataset included numerous examples of humans treated as foreground. When we fine-tuned the pre-trained model on our custom dataset, it resulted in undetermined regions, making the background removal less accurate.

Evaluating the model — A tricky task

At Vestiaire Collective, all images are manually reviewed and clipped images are visually checked by a human. A given proportion of images that are clipped by our third-party provider don’t pass this manual check. Our goal with this project is to be at least as good as the current system, but more economical.

So here comes the difficult part of the project. Background removal quality strongly relies on visual criteria of the clipping and it is very difficult to find a metric that reflects perfectly for the current rejection rate. Usual quality metrics such as f1-score, Dice coefficient, or IoU are not always sufficient to assess the overall quality of the model. The human rejection rate doesn’t correlate well with these classic segmentation metrics.

A much more useful metric we use is the “relax-f1”, a metric used in the original paper of U2Net. This metric is nothing more than a f1-score assessed only on the edges of the clipped object. This metric is particularly efficient because the visual quality of a clipping mostly comes from the quality of the edges.

In order to assess the production performance of the model and compare it against our third-party provider, we use a dataset of 1,000 images manually reviewed by our curation agents. This process is long and costly and prevents quick iterations.

The metrics such as relax-f1 were a powerful way to carefully tune and select the best models to be sent for manual performance evaluation.

Post-processing to enhance and refine model predictions

The model choice, data quality and training strategy are very important but in the end, what differentiates a bad clipping from a good one is the post-processing.

The most important thing to understand here is that all quality metrics that we discussed in the previous section will not be impacted by our post-processing step. This means that even if those metrics indicate good results, there’s still a chance that the output may not look visually appealing, which we are trying to solve with our post-processing strategy.

To understand post-processing, it’s helpful to understand that the model (in our case, U2Net) outputs a probability map for each pixel in a low dimension (320 x 320px). When upscaling this map to the original image’s dimensions, even minor errors and inconsistencies can become more apparent and visually disturbing. For example, a blurry region in the predicted foreground can appear much larger when upscaled, leading to poor visual quality, particularly around the edges of the clipping.

Upscaling blurriness effect on the segmentation prediction

The edges are regions of uncertainty with a smooth gradient between 0 and 1. This is expected as it is the region delimiting the background from the foreground. This is a natural consequence of the model’s attempt to delineate the background from the foreground. In our example, we can also notice a blurry region at the bottom of the shoe which can lead to unwanted effects if left unaddressed.

To resolve this, we can try to binarize the map to get rid of these blurry areas (notice how the bottom of the sole was corrected).

Binarization of the segmentation prediction

However, this method may not be entirely sufficient. Although the edges are now better defined, the smooth transition between the background and foreground has been lost, resulting in a stair-step effect that is less visually appealing.

We need to take two additional steps to address this issue: blurring the mask and stretching it. The blurring will reintroduce a smooth transition between the background and foreground. However, this approach can result in too much blurring, which can further degrade the image. To overcome this problem, we can use a linear stretching step to reduce the blurring radius and achieve smooth, steep, visually appealing edges.

Results and Lessons Learned

Our background removal tool has achieved an impressive level of performance, better than our current third-party provider and significantly reducing costs.

Clipping engine results on fashion items

One of the biggest challenges we faced during this project was obtaining sufficient quantities of high-quality data. We spent a huge amount of time searching for suitable data sources in our database. However, even with limited available data, we were able to jump-start the project, generating interest from stakeholders. The traction with stakeholders then enabled us to access more and better quality data, unlocking budgets and resources.

Final considerations

Building our own background removal tool internally helped us drastically reduce image curation cost. In this project, data was our key to success.

Although the model itself (U2Net) greatly impacts the clipping quality of your tool, keep in mind that post-processing must not be ignored as it may help you get perfect results.