Fashion Recommendation via Image Search

Published in

KBTG Life

7 min readDec 13, 2023

These days, more and more companies are opting to sell their products on an e-commerce market in addition to direct-to-consumer. Consequently, many buyers prefer to shop through online platforms because of its variety, convenience, and other promotional discounts. The growth of online shopping platforms is a win-win for all, including:

Users who have access to a wide range of products, want to save time, don’t have to travel, and seek good deals.
Sellers who have increased opportunities to sell their products to customers in all places.
Platform owners, as having a large number of users using their platform generates more value and revenue for the platform itself.

One of the industries with high market value in e-commerce is the fashion industry. This is because products in this industry are both one of the four basic needs and luxury goods.

Recommending relevant results that buyers would be interested in remains one of the most important tasks for e-commerce platforms. In this article, we will discuss a type of product recommendation systems (RecSys) that’s based on product images, with the goal of offering relevant products that are aligned with the buyer’s interest, thus increasing their chances of making a (or several) purchase.

Sample Menu of Recommended Products on a General Platform

Various techniques have been proposed in RecSys, ranging from basic statistical data usage such as product sales figures and product view counts to utilizing the user’s historical records of past purchases to aid in recommendations. This extends to using Machine Learning to predict products that users may be interested in purchasing.

For RecSys techniques using Machine Learning, there are several methods, such as…

General Recommendation: This is a common technique that utilizes the transaction history of user-item interactions to make item recommendations. One well-known technique in this category is Collaborative Filtering.
Content-based Filtering: This is a technique that suggests items by assessing the similarity of item features to the items that a user has shown interest in.
Context-aware Recommendation: This approach incorporates various user-related contexts (such as gender, age, or income) to assist in making recommendations.
Sequential Recommendation: This method leverages user behavior data from the past (long-term) and considers the order or temporal aspects to make recommendations.
Session-based Recommendation: Similar to sequential recommendation, this technique uses data from individual sessions (short-term) to make recommendations.
Knowledge-based Recommendation: This technique involves utilizing explicit knowledge or domain-specific information related to users, items, or various domains to aid in making recommendations.

In addition to the mentioned techniques, there are many other approaches, such as graph-based recommendation systems, reinforcement learning, explainable recommendation systems, etc.

When we shop for products on various online shopping platforms, we typically encounter different product recommendation zones, and each zone may employ different techniques.

For example, on the main webpage, there are product recommendations for each category based on the sales volume of products in that category.

Recommendation in Video Games Category Based on Popularity

Or techniques like content-based filtering, which recommends products with features or characteristics similar to those the user is interested in.

Or, when we log in and use the platform, there will be a zone recommending items based on our purchase history.

In this article, we will discuss how we can use image search in a fashion recommendation system to suggest similar items (which can be described as a technique within content-based filtering).

The basic concept behind recommending products using image search is quite simple. We will first transform images into vectors and then compare the query vector to those of other product images in the database. Afterward, we will perform a similarity ranking to determine which items should be recommended.

Image Search Based on Ranking Similarity of Vector

One of the advantages of using image search for Content-based filtering in the fashion domain is that most fashion products often have a large amount of similar metadata, such as clothing types or colors.

However, for clothing, what primarily represents the differences between each item usually comes from the patterns or visual appearance on the clothing itself. Therefore, in the fashion domain, image vectors can effectively represent the features of the products.

Image search results with visual search will show visually similar images

Nevertheless, there is research on models that can represent both images and text, known as CLIP (Contrastive Language-Image Pretraining) [1]. The concept of this model consists of both an image encoder and a text encoder, and it trains data with similar semantics between images and text to produce vectors for both images and text that are as close as possible. This training technique is known as contrastive learning.

Since the vector obtained from the image encoder and text encoder represents data from both the image and text parts, this allows us to use it in various search applications, including text to image, image to text, image to image, and text to text. Note that the text encoder and the image encoder are trained concurrently, which also leave room for further improvement if we want to investigate how each encoder represents the data separately as well as alignment between the two representation distributions.

To ensure that the model performs well on our type of data, we used a model that has been trained on data specific to that domain. In this case, there is the FashionCLIP model [2], which has been trained on a dataset of over 700k pairs of images and texts from the Farfetch dataset.

However, since most fashion product images are often taken with a presenter, if we encode the entire image as a vector, it may result in the vector representing information beyond just the main product. Therefore, before we convert various fashion product images into vectors, we should first segment and select the parts of the image that represent the actual products we want to encode as vectors.

For this purpose, we can utilize detection models or segmentation models capable of distinguishing between clothing and accessory areas on the body. In this example, we employed the Segformer model [3], trained with the ATR dataset [4], which is a Human Parser dataset, to aid in the segmentation and cropping of the regions within the product images of our interest.

An Example From the Human Parsing Dataset

We can use the SegFormer model to segment and post-process to crop the area of the product image we are interested in. After that, we will encode the cropped image into a vector for that product. It can be observed that the similarity score value between images that have undergone this cropping process will have a higher similarity score.

Examples of Results From Using the Segmentation Model and Performing Post-processing in Image Cropping

Once we have vectors for all the products in the database, we can recommend similar products for each one by comparing the vector of the product itself with the vectors of other products in the database and ranking them.

And with the mentioned process, we can also recommend products based on the images that users input.

Results of Image Search Based on Query Image

In our next installment, we will extend this into incorporating users’ purchase history. This will allow us to explore more advanced and personalized approaches to enhance the user experience.

References

[1] Radford, Alec, et al. “Learning transferable visual models from natural language supervision.” International conference on machine learning. PMLR, 2021.

[2] Chia, Patrick John, et al. “Contrastive language and vision learning of general fashion concepts.” Scientific Reports 12.1 (2022): 18958.

[3] Xie, Enze, et al. “SegFormer: Simple and efficient design for semantic segmentation with transformers.” Advances in Neural Information Processing Systems 34 (2021): 12077–12090.

[4] Liang, Xiaodan, et al. “Deep human parsing with active template regression.” IEEE transactions on pattern analysis and machine intelligence 37.12 (2015): 2402–2414.

Follow KBTG Life for more stories like this. We have great articles both in Thai and English that are masterfully crafted by KBTG people, so don’t miss out!

Fashion Recommendation via Image Search

References

Written by Thititorn Seneewong Na Ayutthaya