Introducing Complete the Look: a scene-based complementary recommendation system
Eric Kim & Eileen Li | Visual Search
On the Visual Search team at Pinterest, we’re constantly working on ways to help people discover new ideas visually, even when they don’t have the words to describe what they’re looking for. In a traditional image search system, the objective is to return results that are visually similar to a query image, however we’re working with a visual discovery engine where we need to identify and return visual components from a broader scene to recommend ideas like an outfit or living room style, and differentiate and personalize across queries. This makes the larger scene just as important as the main piece in any given Pin. Every visual object within a Pin is an opportunity to search and discover.
In the latest development to derive recommendations for inspiration and shoppable products, we’ve built Complete the Look, which leverages rich scene context to recommend visually compatible results in Fashion and Home Decor Pins. Complete the Look takes context like an outfit, body type, season, indoors vs. outdoors, various pieces of furniture, and the overall aesthetics of a room, to power taste-based recommendations across visual search technology.
In early testing, we’ve found the technology performs significantly better than previous recommendations systems. You can find more details in the paper, accepted at the Conference on Computer Vision and Pattern Recognition 2019 (CVPR): Complete the Look: Scene-based Complementary Product Recommendation.
Modeling style compatibility is challenging due to its complexity and subjectivity. Existing work focuses on predicting compatibility between product images (e.g. an image containing a t-shirt and an image containing a pair of jeans). However, these approaches ignore real-world ‘scene’ images, such as a street style Pin, which can bring complexity with variations in lighting and pose, but on the other hand could potentially provide key context (e.g. the user’s body type, or the season) for making more accurate recommendations.
Our solution was Complete the Look, a novel approach to performing visual complements. A visual complement system should recommend results that complement, or go well with, a query image. For instance, you might be visually searching for shoes that go well with a dress. Results for this query are not bounded by visual similarity but can explore alternate dimensions of stylistic similarity. A visual complement system can be useful for completing your outfit or finding the perfect chairs to go with your new table.
Complete the Look Task
Before we discuss the CTL model details, let’s formalize some terminology. We define a scene image as a real-world image “in the wild”, such as a person out on a sunny day or a chic bedroom setting. This is in contrast to a product image, which is a close-up image of a product, typically with a white background.
We define the CTL task as: given a scene image and a product image, compute a quantitative measure of distance such that the distance measure reflects visual complementarity between the scene and the product. Such a distance measure can be used either by a binary classifier or by a reranker.
Dataset
To train our model, we collected a labeled dataset, which we released publicly here. The dataset consists of positive examples of scene and product image pairs, along with the product category and bounding box annotations. Each pair is augmented with a negative product image that is randomly sampled from the same category. Our model takes this triplet as input during training.
Since we want to discourage the model from memorizing the exact products, we do an additional preprocessing step to crop the product out from the original scene image:
This additional step forces the model to learn compatibility of scene and product strictly independent from visual similarity.
Model Overview
The CTL model is a deep convolutional feed-forward neural network and consists of two modules: the image featurizer, and the CTL head. The CTL head combines global feature similarity with a local spatial attention mechanism that encourages the model to focus on specific regions of the image to inform its decision. We used the ResNet50 model architecture as the image featurizer, pretrained on ImageNet. In all experiments, we do not fine-tune the ResNet50 network.
The CTL model consists of three steps:
(1) Featurize the scene and product images
First, the model generates base features for the scene and product images using the ResNet50 network. We use the block4 feature map.
(2) Compute global similarity.
Next, we compute a global similarity measure between the scene and each positive and negative product image.
This is done by computing scene and product embeddings from the ResNet50 feature maps and computing the L2 distances between the two embeddings:
The two terms in the norm are the scene and product embeddings respectively.
(3) Compute local similarity.
We compute a category-based local attention saliency map that encourages the model to focus on fine-grained details within the scene to inform its decision.
Here, we match the product embedding against every spatial region in the scene image’s intermediate feature map, e.g. block3 of the ResNet50 base network. Because not all scene regions are equally relevant, we weight the matching via a category-based attention map, defined as the L2 distance between the scene region embedding and the target category embedding:
Where s, p are the scene and product, c is the category of p, f_i is the scene embedding for region i, and e_c is an L2-normalized category embedding for category c.
The attention map is category-based, because different items care about different things when it comes to compatibility. For instance, it’s important that shoes match well with the rest of the outfit, whereas for Home decor it’s important for the throw pillows to match the overall aesthetics of the room.
The final similarity measure is the average of the global and local similarity:
Loss function
We train the model using the triplet loss formulation, where the input triplets are: (scene image, positive image, negative image). We use the hinge loss, which encourages that the distance between the scene and positive product image is less than the distance between the scene and negative product image:
Experiments
We compared our CTL model to several baselines on three offline evaluation datasets, in both the Fashion and Home Decor settings. For both the binary classification and the Top-K accuracy settings, we find that our CTL model consistently outperforms the baselines.
Binary classification:
It’s interesting to note that directly using the ResNet50 features for the CTL task is no better than random chance. This suggests that visual compatibility is different from visual similarity, and thus it is necessary to learn the notion of compatibility from data.
Top-K accuracy:
Qualitative Results
Here are recommendations that the CTL model produces for several images in the test set:
Note the full scenes and (ground truth) product images are only for demonstration and are not the input to our system.
Qualitatively speaking, the generated products are compatible with the scenes. The model has learned to suggest products that are not only visually similar to the ground truth (ex. Same color), but also others that have the same style (ex. Minimalist).
Attention maps
Here is a visualization of the attention maps that the CTL model generates on test scene images:
The ‘A’ column is our attention map, and the ‘S’ column is the output of a generic salient object detector, DeepSaliency.
In the fashion domain, our model learns to focus on the subject’s outfit when recommending complements. In contrast, the attentions maps in the interior design domain are more diffuse, and attend to many objects rather than a single subject. This suggests that the model considers the overall aesthetics of the room when recommending complementary products, rather than focusing on a single specific object in the room.
Summary
“Complete the Look” is a novel approach to performing visual complements in a way that leverages rich context from scene images to provide highly personalized recommendations. This project is one of the many exciting problems in the visual search space that the Visual Search team is working on at Pinterest. We’ll continue working on ways to power recommendations across Pinterest using the latest in visual search technology.
Acknowledgments
This work was done in collaboration with Wang-Cheng Kang while he was a PhD visual search intern at Pinterest. We would like to thank Julian McAuley, Jure Leskovec, and Charles Rosenberg for their guidance during the project.
Additionally, we would like to thank Ruining He, Zhengqin Li, Larkin Brown, Zhefei Yu, Kaifeng Chen, Jen Chan, Seth Park, Aimee Rancer, Andrew Zhai, Bo Zhao, Ruimin Zhu, Cindy Zhang, Jean Yang, Mengchao Zhong, Michael Feng, Dmitry Kislyuk, and Chen Chen for their help in this work.