How we remove the background from product images at BestPrice.gr

Alexandros Chariton
BestPrice Dev
Published in
9 min readJul 11, 2022

--

Product images commonly convey information about the product to the consumers without much processing effort from the consumer’s end. Our users do not have the option to physically interact with the product so there is the obvious need for visual descriptors, whose primary job is to provide accurate information about the product. Appealing images take it a step further, they create positive first impressions for products to attract the interest of potential customers.

People visit BestPrice.gr not only for buying a product they need and have in mind, but also browsing. If someone is looking to buy a laptop for example, they may scroll through hundreds of different laptops, stopping for nothing more than mere seconds before moving on. This is common for e-commerce users and one thing that stands out when this behavior is exhibited is that it can be a very lousy experience sometimes if certain criteria are not met.

There are of course those that concern each product image in isolation, for example the quality of the image or its brightness, but there is also the criterion of consistency. It can be difficult for my brain to process so many images at this rate when there’s little to no connection between one another, except for maybe the fact that all of them describe products of the same category. This is extremely apparent predominantly in fashion categories, where the visual signal is much more valuable than say laptops. I find myself spending more time scrolling through our jackets when one image shows a specific jacket and the product next to it has an image of a model wearing the jacket etc. To this end, a common practice in e-commerce is to convert everything to a similar format, ideally with a white background such that the product in question is highlighted. Here is where the background removal tool comes in, it accepts an image of arbitrary size as input and outputs the image with a white background.

Removal

Computer vision is a field of Computer Science revolving around processing images and videos. Before moving on, it’s important to describe what an image representation is, in this field. Each image is associated with two numbers, its height and width, measured in pixels. An image of 100 height and 100 width has a total of 100 * 100 = 10,000 pixels and assuming that we are interested in color images, we need to associate each pixel with a color. Usually, we represent colors using 3 numbers, their so-called RGB values (from Red Green Blue) where these numbers denote the color intensity on the respective filter and lie between 0 and 255. Using this we represent an image as an element of a real number vector space of dimensionality (Height x Width x 3), so the colored image of 100 pixels height and 100 pixels width corresponds to a total of 3 * 10,000 = 30,000 numbers. Expressing images as vectors allows for mathematical operations such as matrix multiplications.

The background of an image is very tricky to define. There is no universal definition for this concept and even humans may disagree on what constitutes the image background for a specific image. Saliency masks are grayscale images that work as probability heatmaps, indicating the most likely pixels within an image that the eyes of an observer might fall, when encountering the visual stimulus. These images can be produced by showing images to groups of people, while being tracked with eye-tracking software. The highlighted parts of these images can be thought as the foreground of the images in question, as we would expect the property of attracting one’s first glance to be a property of the image foreground. This correlation implies that producing a heatmap like that for an image might work for our task at hand. Datasets with pairs of images and annotated masks already exist, for example the DUTS Image Dataset and it looks like there is ground for supervised learning to take over.

Example of an image and its saliency mask. Notice the uncertainty for the stool, which is depicted as grayscale.
Example of a product image with ambiguous background.

Supervised learning works by providing pairs of input — desired output (ground truth) to a machine learning model, so that it can learn by correcting itself until it predicts the output for a given input sufficiently well. This is a very simplistic explanation of how this works but the point is that there is a mechanism connecting the input to the desired output that cannot be identified and hardcoded with if statements for example, so providing a large number of inputs and outputs might lead to the model figuring out this connection (if possible, because sometimes, it just isn’t). This procedure is called training, because an ML model is trained to adapt to our dataset and is typically expensive in terms of computational resources. It requires a large number of floating point operations (FLOPS), therefore GPUs or TPUs are often employed, to speed up the process by parallelizing a large percentage of these operations.

A neural network (NN) is undoubtedly the best ML algorithm choice here, as these algorithms have provided astonishing results in computer vision, far better than the rest. A characteristic of this family of algorithms is that there is a very large number of user defined hyperparameters, such as the network architecture, that need to be set and greatly affect the performance. There is no universal solution when it comes to setting these parameters, it’s greatly problem dependent and requires trial and error. Research on this field greatly reduces the search space for us, as there are many published works on image segmentation with saliency masks. We base our work on “U²-Net: Going Deeper with Nested U-Structure for Salient Object Detection”, where they propose a network with a “U structure” for this very task and also provide a pretrained network we can use on the spot. This network was trained on the DUTS dataset and examples of input-output pairs for this network for product images can be seen below.

Left: Original image, the network’s input. Right: Saliency mask, the ground truth of the output

Producing saliency masks with this pretrained network requires two things:

1) The network details

These are tunable parameters that have already been tuned by the researchers, following the training procedure of feeding examples to the algorithm, described above. We also need the architecture of the model, something like a network graph that works as the skeleton of the machine learning model, connecting those aforementioned weights.

2) The input details

This is also available and is something ML algorithms require. Such details involve input transformations like image resizing, MinMaxScaling or Standardizing. To inference a new image at hand we convert it accordingly, with these same transformations used by the researchers when they trained the model, otherwise we will not get the intended output. The information above is available on their GitHub page, so we can start building the background removal tool. The code is available in Python, the most popular language for machine learning applications and it uses a library called Pytorch, which along with Tensorflow is one of the main tools used to develop NNs.

We can set up a pipeline that accepts an image as input, applies the transformations required and outputs the saliency mask of the image. One important detail here is that the output of neural networks is fixed in size. In fact, the output in this case is always a 320x320 grayscale image (1 channel, instead of 3 in RGB images) which is black for pixels predicted as background pixels, white for pixels predicted as foreground pixels and gray for pixels with more uncertain predictions. This requires handling because we need to come up with a way to deduce the background of an image of arbitrary size using a fixed size heatmap. To do that we could resize the original input image to 320x320 or we could resize the saliency mask to fit the original image. The benefit of the latter is that we will end up with a white background image of the same size, so we will move forward only considering this option. When it comes to resizing images, options such as Nearest Neighbor interpolation or Linear Interpolation are the most organic.

Usually the input images are larger than 320x320 so we need to expand the filter to match their dimensionality. Nearest neighbor interpolation expands the image by assigning the intensity value of a pixel to its neighboring pixels until it reaches the appropriate size. So, in our case, if there was a pixel located at the center of the image, predicted as a foreground pixel with a probability of 0.9, then there’s a good chance that every pixel around it would also be a foreground pixel with the same probability. Linear interpolation works similarly, except that it expands the image by estimating intensity values using first degree polynomials. In image context, bilinear interpolation is used, meaning that the intensity of the pixels missing is calculated using both directions across the image.

Interpolation examples. Rounding is used for Bilinear interpolation

Producing these masks is the most important need in background removal, however we’re not there yet. The actual removal has to take place and so far what we have is the input image and a foreground probability for each pixel.

The steps from now on are more straightforward, we could for example multiply these probabilities with the pixel intensity to produce the “expected image foreground”, in which we would expect a soft transition from the foreground to the white background. This might not be desirable in most cases, we may want to capitalize on the high contrast border around the product to highlight the product itself. This can be done by converting this probability heatmap into a map of 1s and 0s, where 1 denotes that this pixel belongs to the foreground and 0 otherwise. We can arbitrarily set a threshold, 0.7 for example, only keep pixels with probabilities larger than this threshold and convert everything else to white. A potential benefit of converting these scores to binary decisions is that there’s a good chance this might decrease the effect of noisy predictions. Arguably the most difficult part of the task is estimating accurate scores for each pixel, so if for example a neighborhood of pixels is assigned scores that fluctuate around 0.8, then everything would be considered as foreground irregardless of score. NNs would learn, ideally, that such large fluctuations are rarely good, but still it’s a good idea to alleviate some burden of the more difficult component of the mechanism.

Of course there are alternatives to binary thresholding (such as Otsu’s method) but a binary threshold is usually good enough for the task. In our app, threshold can be provided as a user hyperparameter, so one might attempt removal with a large threshold and then gradually decrease it, if the removal is too strict. For this part we can use a Python library called OpenCV, a reliable choice for computer vision tasks or Pillow which is also pretty popular in the field.

The difficult part of the removal should be considered done. What’s left now is the app part. We need to convert this functionality to an API endpoint, so that our Content team can forward their images and get the white background versions back. We will not dive deep into this part as there are numerous tutorials describing how to build a service like that. In Python, libraries such as Flask or FastAPI are excellent choices there. It is also important to mention that there are other ways we can improve the removal quality with basic image processing. For example, we found that for product images it is usually a good idea to greatly sharpen the image before predicting its saliency mask, because the border around the object becomes clearer.

Images of our catalog. The image on the right sort of breaks the pattern.
Image on the right after removal. Kudos to our wonderful Content team for providing the images.

Conclusions

The solution above is a solution that works moderately well, for a difficult Computer Vision task. Considering the expensive alternative of manually removing the background or not doing anything at all, there is value in having this tool around in our organization. It’s worth mentioning that the most popular service for this task is offered by remove.bg, where they also consider the benefits of its usage in e-commerce context. The Open Source Community provides the opportunity to build things that might have been considered complicated years ago, without putting too much effort into it. Also, ML research is rapidly growing, so there’s a good chance we will come up with a better removal tool soon enough. It’s always a good idea to stay up to date with the latest advances and take advantage of the great work being done in the field.

--

--