Boost Time to Market by Exploring Possible Visual Recommenders in FashionTech

Neurons Lab

Published in

Neurons Lab

10 min readDec 28, 2022

This story was originally published on the Neurons Lab blog on Dec 28, 2022.

How can you avoid product or startup failure at the stage of the solution requirements building?

To avoid survival bias, you should check out this startup’s failure analysis. Among the top 3 failures are running out of cash, a bad market fit, and tough competition. Of course, no one knows what guarantees success, but you can at least be more aware of what guarantees failure — not enough time to become profitable, to pivot to the right problem, or to build competitive advantage.

Here are a few examples of startup failures in the FashionTech industry that can be attributed to problems with the solution itself:

Fab.com: struggled to differentiate itself from other online marketplaces and failed to generate sufficient revenue from its sale of home decor and fashion products.
Rent the Runway: faced challenges with inventory management, which led to delays in fulfilling orders and frustrated customers.
Nasty Gal: struggled to maintain profitability and faced backlash over the quality of its products, leading to a decline in sales and customer loyalty.
The Honest Company: faced criticism over the safety and effectiveness of its products, leading to a loss of customer trust and a decline in sales.

This is why money and time should be allocated judiciously in all product lifecycle stages. Here, we describe a couple of insights regarding Computer Vision-based Fashion Recommender (CVFR) alternative approaches.

Knowing these approaches in advance could accelerate the move from solution exploration to the requirement-building stages. For example, some companies may spend one or two months collecting and analyzing all the requirements for the future project, which is a significant time span.

In the case of CVFR, there are a lot of questions that arise for product and engineering teams:

Is there any problem with outfit recommendations for consumers? What kind of problem? Should an outfit be generated from scratch or complimentary finished?
Who wants to get such recommendations for free? What about on a paid basis?
How should recommendations work in the customer journey? Or, should we disrupt this journey? Is disruption feasible?
What are current SOTA approaches?
What does “fit well or compatible” mean regarding an outfit for consumers with different tastes?
How should requirements for CVFR be tested?

Your team is surely moving through a list of similar questions.

First, pay attention to the problem-solution fit that requires you to answer the question, “is this worth solving?“

Next, the product-market fit requires you to answer, “have I built something people want?”

Engage the testing mindset for product development

“The biggest reason people do that is that they don’t pay enough attention to the user.” P. Graham

First of all, engage the testing mindset. You should plan a design and testing approach that provides a lot of consumer preferences and gained knowledge.

This part is essential if you are working in a field of broad, subjective differences in fashion and apparel.

Step 1: Focus on various consumer-expressed “data touch points” about their tastes and use them effectively.

Open- and closed-source datasets
Careful research of what annotations were applied by end-consumers or designers in the data (SME)
How can you augment and improve those datasets?

Step 2: Think with your engineers about the tasks you should reflect on from your problem space.

ML approaches are evolving, so you can always start with something simple to get the first results and gradually come to a State-of-the-Art solution.
Careful ranking and weighting of Data Science approaches in terms of effort, risks, and benefits in technical performance and data requirements.
Infrastructure requirements and related costs will influence unit economics for every recommendation.

Step 3: Use a backward testing approach.

Try to begin with testing in mind: how and what are you planning to measure?
Use thoroughly technical, qualitative, and subjective metrics for annotation and testing.
You should consider the solution’s testing iteration schedule and scale.
The correct testing framework will mitigate most risks, accelerate the time to market, and decrease cash burn.

How can you build the requirements for CVFR with your Data Science team?

Let’s start with the high-level task separation in the AI FashionTech area. Next, we will answer a list of crucial questions to give you a complete understanding of how to use existing solutions for CVFR.

1. What kind of tasks should the Data Science team resolve to feed the recommendation engine?

There is a range of CV applications in FashionTech. This includes:

Fashion Detection for visual similarity
Fashion Analysis for trend forecasting
Fashion Synthesis for virtual try-on
Fashion Recommendations for outfit generation and outfit compatibility scoring

You can check out this article for more details about the use of emerging technologies in FashionTech for improving customer engagement.

2. What are the main tasks in the fashion recommendation area?

Compatibility prediction (CP): predict the compatibility of items in a given fashion outfit (for example, generated by a user), as per the Han et al. schema. You can see the most compatible outfit on top and the least compatible option on the bottom (no shoes or colors that are not stylish).

Fill-in-the-blank, from Han et al. suggesting an item that fits well (as the most compatible) with an existing set by prediction score. Usually, there is a choice of just a few objects (e.g., four candidates below).

Complimentary Item Retrieval (CIR): Outfit complementary item retrieval requires you to find compatible pieces to complete an outfit. For example, find a hat that is compatible with a partially constructed outfit with a top, bottom, and shoes. This is similar to FITB in that it requires you to “complete” an outfit; however, FITB requires you to select the best item from among a fixed set of choices that go well with an outfit. For retrieval, the task is to choose the best thing from the entire database. This problem is essential for retailers, as customers try to find well-fitted garments from what has been selected.

Prior research has been focused on pairwise item comparisons. An example of Pairwise Complementary Retrieval scores a pair of garments, for example the top and bottom of an outfit, and matches those with the highest score. Based on the schema of Veit et al., you can see items on the left that are compatible with things on the right by modeling compatibility prediction results.

Later, most researchers focused on the outfit-level and large-scale datasets. As well, compatibility prediction scores in previous methods can be used for ranking items. Still, it is impractical to do so in a large-scale setting (you can imagine an e-commerce collection of thousands of items).

Based on the example from Lin et al., on the left, there are incomplete outfits and missing target items (in the bounding box). On the right are retrieval results from over 3,000 images with ground truth in a green box:

Outfit generation: generating an outfit from scratch given text/image (multimodal) inputs from users (user can ask for “business” or “casual style”)

3. What approaches have been used for resolving similar tasks so far?

Graph Convolutional Network (GCN) has been used to build a graph of fashion items for Compatibility Prediction (CP). Compatibility is posed as an edge prediction problem. There is an encoder that computes new embeddings for each product depending on their connections, and a decoder that predicts the compatibility score of two items.

Bidirectional LSTM: bidirectional long short-term memory (BiLSTM) models outfits as an ordered sequence. The model leverages visual data alongside one-hot encoded product descriptions and treats CP and FITB as the next item in a sequence prediction problem.

Transformer-based architecture: an outfit is modeled as an unordered set of items. An image encoder (Eimg) and text encoder (Etext) extract the image and text features. For Complimentary Item Retrieval (CIR-task), given an outfit and a target item description, a transformer encoder learns a target item embedding used for retrieving compatible items to complete an outfit.

Papadopoulos et al. used transformer-based architecture and used it to resolve newly formulated tasks, such as:

1) Mismatching Item detection tasks

2) Compatibility prediction as a regression — most previous approaches defined compatibility as a binary classification task

4. What datasets can be used for AI training?

There are a lot of datasets for fashion recommendations and we can name a couple of them.

Polyvore is one of the most popular datasets for experiments with recommendations. Polyvore (www.polyvore.com) was a popular fashion website where users could create and upload outfit data. You can surely think of many “annotated,” positively compatible items.

Polyvore contains 164,379 items that form 21,899 outfits. The maximum number of items per outfit is 8 and the average is 6.5.

These fashion outfits contain rich multimodal information like images and descriptions of fashion items, the outfit’s number of likes, hashtags, etc.

The fashion-Gen Outfits dataset from Rostamzadeh et al. represents professional stylists’ descriptions (tagging).
Amazon Products dataset, emphasizing users who viewed items and users who bought them (simultaneously or not), which is a significant factor for applications in e-commerce. The e-commerce business looks for “voting by money spent” on the garment, not just visual satisfaction expressed through likes. The Amazon products dataset contains over 180 million relationships between almost 6 million products of different categories. There are four relationships between items: users who viewed A also viewed B; users who viewed A bought B; users who bought A also bought B, and users who bought A and B simultaneously.
AemicaChow collected a lot of Fashion Datasets, that you can use for your experiments.

There are a lot of nuances related to those datasets that you should keep in mind.

Understand clearly which task you aim to resolve.
Select positive and negative samples carefully (compatible and non-compatible outfits.). End consumers tag the data and define compatibility by their subjective taste. For example, negative samples can be from the same category as positive ones or others.
Build and test joint and disjoint sets with overlapping garments for training, testing, and validation.
Check the text description carefully, as it can be very noisy.

5. How can you test CBFR solutions?

You can start with the quant metrics but be aware that ground-truths are labeled or annotated subjectively, which means quant metrics relate purely to the dataset, not the real-world application.

Compatibility prediction as a classification task (choosing compatible vs. non-compatible items) uses AUC as the metric.

For FITB, given a subset of items in an outfit and a set of candidate items (e.g., four things, one positive, and three negatives), the task is to select the most compatible candidate. The performance is evaluated based on the overall accuracy.

For CIR-task, recall@top-k (abbreviated as R@k) is the metric for the relative ranking of positive images among k items in the database. We can use incomplete outfits for queries and pictures with similar styles; the positive image will move forward in the ranking.

For measuring the task of “voting by money,” Cucurull et al. used the Amazon product dataset and measured the purchasing prediction accuracy.

Qualitative performance measuring consists of a couple of approaches:

Visual inspection of the results by users and Subject Matter Experts
Real-world series of experiments with real e-commerce platforms to collect “purchasing” annotations

For example, Sarkar et al. have been running their A/B tests using Amazon Mechanical Turk. The team formulated the question and built an experiment for a complimentary retrieval task comparing modeled recommendation with the ground truth. The experiment showed comparative quality (measured by choosing rate) of the modeled garments in a pair with the ground truth.

Chen and al. Ran through a 7-day experiment of actual recommendations and measured Click Through Rate (CTR). Their experiment shows CTR improvement based on Outfit Recommendations by their model.

To sum things up

You can see that achieving product-market fit requires the thorough selection of testing approaches during the solution building stage.

The mindset of backtesting helps to define tasks, select appropriate methods, and build an iterative testing structure. The highly subjective field of fashion apparel requires the attention of the entire development team to the aforementioned elements.

Determining the scope of investment in various quantitative, qualitative, and user preference-based testing approaches will accelerate the time to market and improve the quality of MVPs to the scale-up stage.

Want to receive a consultation for utilizing AI in the fashion field?

Drop us a line to receive an expert evaluation on creating AI prototypes to deliver results. We have experience allocating industry experts, PhDs, engineers, and data scientists to match your business needs and technical solution design.

Contact Neurons Lab now