Buy Me That Look

Naman Gupta
May 11 · 11 min read

This blog is all about the Fashion Recommendation system. It is a unique recommendation system compared with others because here based on provided photo/picture, the system recommends similar clothes or articles worn by the model in the picture. The architecture and design components are inspired by a Paper: Buy Me That Look: An Approach for Recommending Similar Fashion Products.


  1. Business problem
  2. ML/DL formulation
  3. Business Constraints
  4. Data Acquisition and Analysis
  5. Research section
  6. My Approach
  7. End Results
  8. Future Work
  9. References

1. Business Problem:

The research paper states, in short, is to detect all products in an image and retrieves similar fashion clothes from the database with a product buy link.

Online business has become important in day-to-day life for everyone. Virtual stores allow people to shop from the comfort of their homes without the pressure of a salesperson. In this paper, the authors focus on the retrieval of multiple fashion items at once and propose an architecture that detects all products from an image and recommends similar kinds of products.

In stores, we can carry a piece of cloth and request the salesperson to show us similar kinds of products matching Color, design, thickness, etc; But online it is not possible and it is time-consuming while searching for similar kinds of products. So, here we can upload an image and search for similar kinds by using computer vision.

2. ML/DL formulation

Let’s discuss the architecture of the model in this session. Will divide this problem into various stages:

Stage 1: (Pose Estimation)

In this stage, We will detect whether the image is a Full-Front-pose image or not. So this will be a binary classifier (Yes/No)

Stage 2: (Localization)

In this stage, we detect all the articles (clothes) and particular places the article is placed or located. This will be Both a classification and Regression Problem. Classification because of Article Detection and Regression because of Localization (Bounding Box Co-ordinates)

Stage 3: (Image_embeddings)

In this stage, we will generate the embedding ( dense vectors ) for the images as discussed below.

Stage 4: (Getting similar Images)

In this stage, we will use Faiss library to fetch similar clothes based on search query.

3. Business Constraints

  • Scalable: Our system architecture should be scalable because every day thousands and thousands of new images are going to be added to the site.
  • Low Latency: A customer is not going to wait for minutes or even for more than 5–10sec for a recommendation. So our architecture should be able to retrieve recommendations within a given time frame.
  • Real-time constraints have to be kept in mind as it is an offline recommendation system. Interpretability is important from the customer side, it can be helpful to state why a recommendation is made. When recommending a product to a user, System added a link to the product that the user watched and triggered the recommendation.

4. Data Acquisition and Analysis

Data is scraped from Myntra. I used selenium for that. This data is not labeled.

The data has different types of clothes(upper ware, lower wear, and footwear) for ladies. As this data scrapped from Myntra we have masks or bounding boxes for article/clothes localization/detection. So, for the article detection and localization part, I have taken the data from Kaggle Competiton: iMaterialist (Fashion) 2019 at FGVC6. This Fashion data has 45.2k files approximately with output In Encoded pixel Format with class_labels.

5. Research section

  1. Main Paper:

I took the below picture from a research paper.

Buy me that look, Fashion recommendation system blueprint.

Buy me that look, Fashion recommendation system blueprint.

From the paper, the architecture is as explained below.

  1. By using a pose detection classifier, we must detect full shot images, and based on FFS, we must find front-facing images.
  2. Front-facing images are passed to a CNN network with active learning detects the fashion objects in the image and does localization.
  3. Image embeddings are created for all images available in the catalog and stored in the database. Triplet-Net-Based Embedding Learning is used to generate the test data. we can also use a simple CNN-based autoencoder.
  4. A query image is passed, and similar images are retrieved from the database. Here they have used cosine similarity to get similar metrics from the database.

2. Pose detection:

In this blog, various approaches are used for posing detection problems. By using these pre-trained models, we can save a lot of time. Choosing the architecture that works best on the dataset then later fine-tune or modify the architecture to get the best results.

3. Localization / article detection:

In this blog, users have explained about different labeled datasets for fashion object detection. This pre-trained dataset can be used on top of our data to increase the accuracy of the model. The blog has a detailed explanation of how to use object detection API for Tensorflow with good snippets of code which will help try the above models first as a black box and then later choose the architecture that gives the best results on our datasets.

object detection API for Tensorflow came several pre-implemented architectures with pre-trained weights on the COCO (Common Objects in Context) dataset, such as

SSD (Single-Shot Multi-box Detector) with Mobile Nets

SSD with Inception V2.

R-FCN (Region-based Fully Convolutional Networks) with Resnet 101.

Faster RCNN (Region-based Convolutional Neural Networks) with Resnet 101.

Faster RCNN with Inception Resnet v2

4. Triplet Loss:

This blog has a good explanation of how to use Triplet loss for image similarity problems. So, my understanding of Triplet Loss architecture helps us to learn distributed embedding by the notion of similarity and dissimilarity. It’s a kind of neural network architecture where multiple parallel networks are trained that share weights among each other. During prediction time, input data is passed through one network to compute distributed embeddings representation of input data.

Loss function: The cost function for Triplet Loss is as follows:

L(a, p, n) = max(0, D(a, p) — D(a, n) + margin)

where D(x, y): the distance between the learned vector representation of x and y. As a distance metric L2 distance or (1 — cosine similarity) can be used. The objective of this function is to keep the distance between the anchor and positive smaller than the distance between the anchor and negative.

6. My Approach

Here I will be explaining my implementation of the business problem.


In Module1, for pose detection, I tried using HRNet and TensorFlow lite models. Both model outputs are almost similar. So, I picked up HRNet. From the below snippet it is clear that both models have similar results.

So, here I have used the HRNet from my research section to find all full poses and front posing images from my corpus.


Result of Module 1

If the Image is found to be full pose then this image is sent to module 2


In module 2, we have to detect all the articles and localize them. For that, I have used the MaskRcnn model. I took data from the Kaggle competition “iMaterialist (Fashion) 2019 at FGVC6”. After the localization, we have to crop the images and pass them to Module 3 for generating embedding.

How does MaskRCNN works?

Mask R-CNN (regional convolutional neural network) is a two-stage framework: the first stage scans the image and generates proposals (areas likely to contain an object). And the second stage classifies the proposals and generates bounding boxes and masks. Mask R-CNN paper is an extension of its predecessor, Faster R-CNN, by the same authors. Faster R-CNN is a popular framework for object detection, and Mask R-CNN extends it with instance segmentation, among other things.

This tutorial requires TensorFlow version 1.15.3 and Keras 2.2.4. It does not work with TensorFlow 2.0+ or Keras 2.2.5+ because a third-party library has not been updated at the time of writing.

!pip install — no-deps tensorflow==1.15.3

!pip install — no-deps keras==2.2.4

Mask R-CNN is basically an extension of Faster R-CNN. Faster R-CNN is widely used for object detection tasks. The Mask R-CNN framework is built on top of Faster R-CNN. So, for a given image, Mask R-CNN, in addition to the class label and bounding box coordinates for each object, will also return the object mask.

  1. Faster R-CNN first uses a ConvNet to extract feature maps from the images
  2. These feature maps are then passed through a Region Proposal Network (RPN) which returns the candidate bounding boxes
  3. We then apply an RoI ( Region of Interest ) pooling layer on these candidate bounding boxes to bring all the candidates to the same size
  4. And finally, the proposals are passed to a fully connected layer to classify and output the bounding boxes for objects

Similar to the ConvNet that we use in Faster R-CNN to extract feature maps from the image, we use the ResNet 101 architecture to extract features from the images in Mask R-CNN. So, the first step is to take an image and extract features using the ResNet 101 architecture. These features act as an input for the next layer.

Now, we take the feature maps obtained in the previous step and apply a region proposal network (RPM). This basically predicts if an object is present in that region (or not). In this step, we get those regions or feature maps that the model predicts contain some objects.

The regions obtained from the RPN might be of different shapes, right? Hence, we apply a pooling layer and convert all the regions to the same shape. Next, these regions are passed through a fully connected network so that the class label and bounding boxes are predicted.

Till this point, the steps are almost like how Faster R-CNN works. Now comes the difference between the two frameworks. In addition to this, Mask R-CNN also generates the segmentation mask.

For that, we first compute the region of interest so that the computation time can be reduced. For all the predicted regions, we compute the Intersection over Union (IoU) with the ground truth boxes. We can computer IoU like this:

IoU = Area of the intersection / Area of the union

Now, only if the IoU is greater than or equal to 0.7+, we consider that as a region of interest. Otherwise, we neglect that region. We do this for all the regions and then select only a set of regions for which the IoU is greater than 0.7+.


Localization / articel detection


In Module 3, I tried with DenseNet121, ResNet50, ResNet101, MobileNet, and InceptionV3. Out of all these. DenseNet121 gave good results.

DenseNet121 is low sparsity compared with others. I choose Densenet121.

DenseNet generated 1024 dimensional embedding with low sparsity.

As we saw that we have 8 categories of data, I have divided them into 3 super categories for indexing as below.

Upper_wear: women_shirts_tops_tess
Lower_wear: women jeans juggings, women skirts, women trousers
Foot_wear: women casual shoes, flats, heels


In Module 4, I have used FAISS (Facebook AI Similarity Search) library to retrieve similar articles.

Faiss works only with float32 type ndarray. So, first, we have converted our embedding into ndarray type float32.

Created 3 indexes with faiss for upper ware, lower ware, and footwear.

From the Facebook Github page provided above, IndexFlatL2 is a brute-force one compared with others that use Euclidean distance to calculate the nearest distance. So, I used it. Also to reduce the space complexity used an IndexIVFPQ quantizer. We can use cosine as well, but we have to normalize the vector before using cosine. Normally cosine distance is used in text similarities.

The index takes only one parameter, which is nothing but a vector with any shape but if we are passing multiple vectors make sure all vectors are in the same shape.

We have a search method with Faiss which depends upon index values to retrieve similar articles. In the search method also we have to pass the vector with the shape that matches with the indexing vector.

As the generate embedding in module 3 returns a list with length 2014, not in the same shape ( row vector ) and type ( float 32 ndarray). we have to convert it first before searching.

7. End Results

So we have an out final solution. Model is able to detect and retrieve fashion objects from the given image. So there are a few wrong object detection and wrong retrieval but this is because the model is trained for fewer epochs. Some wrong retrievals are because whole images are embedded and not the objects. Also, the database size is also very small. But overall, we have a first-cut solution that can be further expanded and optimized. please check out in Github link.

8. Future Work

  1. Reduce the latency of the end-to-end application.
  2. For embeddings try different approaches like build your own model with some good score.
  3. Collect more data for object detection and use the latest segmentation approach other than MASK RCNN.
  4. Train the Triplet-net Based Embedding layer network for getting similar images as per the research paper.
  5. Deploy a model into production for real-time recommendation



If you have any queries, feel free to comment down or you can also contact me on Linkedin. You can find my complete project here.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store