Building Lens your Look: Unifying text and camera search

Published in

Pinterest Engineering Blog

5 min readNov 14, 2017

Eric Kim | Pinterest engineer, Visual Search

In February we launched Lens to help Pinners find recipes, style inspiration and products using the camera in our app to search. Since then, our team has been working on new ways of integrating Lens into Pinterest to improve discovery in areas Pinners love most–particularly fashion–with visual search. What we’ve learned is some searches are better served with text, and others with images. But for certain types of searches, it’s best to have both. That’s why we built Lens your Look, as an outfit discovery system that seamlessly combines text and camera search to make Pinterest your personal stylist.

Launching today, Lens your Look enables you to snap a photo of an item in your wardrobe and add it to your text search to see outfit ideas inspired by that item. It’s an application of multi-modal search, where we integrate both text search and camera search to give Pinners a more personalized search experience. We use large-scale, object-centered visual search to provide us with a finer-grained understanding of the visual contents of each Pin. Read on to learn how we built the systems powering Lens your Look!

Architecture: Multi-modal search

Lens Your Look is built using two of Pinterest’s core systems: text search and visual search. By combining text search and visual search into a unified architecture, we can power unique search experiences like Lens your Look.

The unified search architecture consists of two stages: candidate generation and visual reranking.

Candidate generation

In the Lens your Look experience, when we detect the user has done a text search in the fashion category, we give them the option to also take a photo of an article of clothing using Lens. Armed with both a text query and an image query, we leverage Pinterest Search to generate a high-quality set of candidate Pins.

On the text side, we harness the latest and greatest of our Search infrastructure to generate a set of Pins matching the user’s original text search query. For instance, if the user searched for “fall outfits,” Lens your Look finds candidate results from our corpus of outfit Pins for the fall season.

We also use visual cues from the Lens photo to assist with candidate generation. Our visual query understanding layer outputs useful information about the photo, such as visual objects, salient colors, semantic category, stylistic attributes and other metadata. By combining these visual signals with Pinterest’s text search infrastructure, we’re able to generate a diverse set of candidate Pins for the visual reranker.

Visual reranking

Next, we visually rerank the candidate Pins with respect to the query image, such as the Pinner’s article of clothing. The goal is to ensure the top returned result Pins include clothing that closely match the query image. Lens Your Look makes use of our visual object detection system, which allows us to visually rerank based on objects in the image, such as specific articles of clothing, rather than across the entire image.

Reranking by visual objects gives us a more nuanced view into the visual contents of each Pin, and is a major component that allows Lens your Look to succeed. For more details on the visual reranking system see our paper recently published at the WWW 2017 conference.

Multi-task training: Teaching fashion to our visual models

Now that we have object-based candidates, we assign a visual similarity score to each candidate. Although we’ve written about transfer learning methods in the past, we needed a more fine-grained representation for Lens your Look. Specifically, our visual embeddings have to model certain stylistic attributes, such as color, pattern, texture and material. This allows our visual reranking system to return results on a more fine-grained level. For instance, red-striped shirts will only be matched with other red-striped shirts, not with blue-striped shirts or red plaid shirts.

To accomplish this, we augmented our deep convolutional classification networks to simultaneously train on multiple tasks while maintaining a shared embedding layer. In addition to the typical classification or metric learning loss, we also incorporate task-specific losses, such as predicting fashion attributes and color. This teaches the network to recognize that a striped red shirt shouldn’t be treated the same as a solid navy shirt. Our preliminary results show that incorporating multiple training losses leads to an overall improvement in visual retrieval performance, and we’re excited to continue pushing this frontier.

Conclusion

Since launching our first visual search product in 2015, the visual search team has developed our infrastructure to support a variety of new features, from powering image search in the Samsung Galaxy S8 to today’s launch of Lens your Look. With one of the largest and richly annotated image datasets around, we have an unending list of exciting ideas to expand and improve Pinterest visual search. If you’d like to help us build innovative visual search features, such as Lens Your Look, join us!

Acknowledgements: Lens Your Look is a collaborative effort at Pinterest. We’d like to thank Yiming Jen, Kelei Xu, Cindy Zhang, Josh Beal, Andrew Zhai, Dmitry Kislyuk, Jeffrey Harris, Steven Ramkumar and Laksh Bhasin for the collaboration on this product, Trevor Darrell for his advisement and the rest of the visual search team.