Unifying visual embeddings for visual search at Pinterest

Andrew Zhai | Pinterest Tech Lead, Visual Search

Throughout the years, Pinterest has launched a variety of visual search products powered by computer vision to help people discover new ideas to try and products to buy. We started in 2015 with our visual cropping tool, allowing users to search within a Pin’s image (for example a lamp in a larger living room scene) and browse visually-similar content on Pinterest. In 2017, we launched Lens camera search, opening up Pinterest’s visual search to the real world by turning every Pinner’s phone camera into a powerful discovery system. And in 2019, we’ve launched Automated Shop the Look so Pinners can find and shop for exact products within a Pinterest home-decor scene. Visual search is one of the fastest-growing products at Pinterest with hundreds of millions of searches per month.

At the core of these visual search technologies are visual embeddings, which power the match systems that enable Pinners to browse 200B+ ideas from any image — via Pin or camera — and search for exact products. Visual search has extended discovery not only across online to off but allowed for Pinners to dive into any Pin they see on Pinterest and search further.

Evolution of visual embeddings

Pinterest visual embeddings have evolved in terms of both modeling and data over the years.

We started our work with visual embeddings in 2014 as we aimed to develop our first productionized visual search product: the visual cropping tool. We heavily leveraged engagement datasets from Pinterest as we aimed to make a product whose query and corpus are Pin images. When we started working on Lens in 2016, the biggest challenge we had was learning the domain shift of camera query images to Pin images. Because camera images often aren’t the type of highly engaging visual content we normally see on Pinterest, we collected a human curated dataset matching camera images to Pin images. Similarly, when we aimed to automate Shop the Look in 2018, we wanted to specifically optimize for exact product matches. Using existing Shop the Look datasets as a noisy candidate set, we again leveraged human curation to generate high-quality exact-product-match training datasets. In parallel to the dataset efforts, we naturally progressed through SOTA image classification architectures, starting with AlexNet and VGG16 in 2014, ResNet and ResNeXt in 2016, and SE-ResNeXt in 2018.

Visual cropper, Lens, and Shop-the-Look optimize for different objectives. With the visual cropper, we want to optimize for general Pinterest browsing over our corpus of 200B+ ideas, with Lens we need to optimize for the domain shift of camera to Pin images, and with Shop-the-Look, we look to find exact product matches from our catalog of products.

When improving an existing application with new embeddings, we would replace the old embeddings with the newer, better-performing ones. However, when developing embeddings for a new application, with the spirit of modular design and simplicity, we applied the application-specific dataset with the SOTA model architecture and generated new independent embeddings for the new application. Over time, however, we observed having separate embeddings per application (Visual Cropper, Lens camera search, Shop the Look) became a technical debt as our focus shifted to developing/improving embeddings for each new application over existing applications while the underlying training infrastructure evolved. An example of this is prior to 2019, the Visual Cropper embedding was last deployed in 2015 using Caffe as the training/serving framework with VGG16 as the backbone model. In comparison, the Shop the Look embedding in 2018 relied on our PyTorch training and Caffe2 serving infrastructure with SE-ResNeXt as the backbone. The Visual Cropper embedding was also very important as a general content signal for 10+ clients within Pinterest since it was trained with Pinterest engagement, while our newer embeddings focused on human-curated signals with less impact against the Pinterest ecosystem.

Over the years, we developed independent embeddings for specific visual search products, making it difficult to simultaneously improve our visual search products. We look to simplify with one unified visual embedding for all visual search products.

With the engineering resources we had, we clearly could not continue scaling to new applications with our current embedding development paradigm. With both cost and engineering resource constraints and an interest in improved performance, we aimed to learn one unified multi-task visual embedding that could perform well for all three visual search applications.

Metric learning background

Our visual embedding is a compressed vectorized representation of an image that is the output of a convolutional neural network trained for a target similarity via metric learning.

Visual similarity is defined by the distance between visual embeddings extracted from convolutional neural networks.

Traditionally, metric learning trains on explicit relational datasets, one example of which is a triplet dataset (q, p, n) where we have a q anchor image, a p positive image that is known to be related to q, and an n negative image that is unrelated. During training, image embeddings are explicitly compared against each other. One of the main challenges of metric learning is deciding how to select the most informative n negative image.

Alternatively, proxy-based metric learning trains on classification datasets (q, “nail”), (p, “nail”), (n, “bowl”) where relationships between images are implicitly defined; similar images share the same label while dissimilar images have different labels. During training, image embeddings are compared against label embeddings (proxies) in a classification loss. Negative sampling issues are alleviated with proxy-based approaches since we usually have many fewer labels than images, which allows us to compare image embeddings over a larger number of negative label embeddings during each minibatch iteration of training (sometimes fitting all label embeddings in GPU memory). At Pinterest, we train our visual embeddings with the proxy-based paradigm as we have seen at least comparable performance of proxy-based approaches to traditional metric learning approaches.

Metric Learning learns relationships between images explicitly. Proxy-Based Metric Learning learns relationships between images implicitly by clustering images to their relevant labels

Unified visual embedding as a solution

Extending the proxy-based approach of our previous work, we trained our multi-task visual embedding by combining the application-specific datasets with multiple softmax classification losses. Application-specific datasets are uniformly blended in each minibatch, and all the tasks share a common base network until the embedding is generated, at which point each task splits into its own respective branches. Each task branch is simply a fully connected layer (where the weights are the proxies) followed by the softmax cross-entropy loss. We train our model with PyTorch in Distributed DataParallel manner with FP16 training using the Apex library with extensions to support our architecture. Please look to our KDD’19 paper for details, such as visualizations of the application-specific datasets, subsampling proxies during training to improve training efficiency and binarization of our embedding to improve serving efficiency.

Our visual embedding model architecture trained with PyTorch DistributedDataParallel with FP16 Mixed Precision training

Using offline retrieval metrics for each application (Visual Cropper (visual search within Pins), Lens camera search, Shop the Look), we saw that:

  1. Multi-tasking over the datasets led to better performance on all applications over training on each individual application-specific dataset with the same architecture on the corresponding application.
  2. Our unified embeddings outperformed the old deployed application-specialized embeddings where improvement comes from both newer datasets and better model architectures. With our unified embeddings, we now have a framework that is both easy to maintain (one model to train and maintain) and outperforms existing benchmarks.


Offline metrics showed that our unified embeddings outperform existing systems. To launch our embeddings, however, what really mattered was online A/B experiments where we test two versions of our systems across applications, one with the unified embedding versus another with the currently deployed specialized embedding for the corresponding application. At Pinterest, we measure both engagement and relevance, the former coming from live user feedback and the latter coming from human judgement templates tuned specifically for the given product’s objective (e.g., browsing vs. shopping, visual cropper vs. camera).

Human Judgement (first row) and A/B experiment engagement results on our visual search products. Overall our unified visual embedding led to significant gains in relevance and engagement over our existing specialized embeddings.

Overall, the unified embeddings performed very well in both engagement and relevance, as shown in the above table. Repinners and clickthroughers measure the percentage of users engaging in that action, while repins and clickthroughs measure action volume. Automated Shop the Look was in development as we evaluated our unified embeddings, hence there were no live A/B experiment results for it. Unified embeddings also led to large amounts of cost savings as we were able to unify our visual search retrieval infrastructure under one embedding.

Visualization of the specialized embeddings (first row) vs unified visual embedding (second row) on the Visual Cropper. We can see by combining all training datasets into one model, we have a unified visual search system that is optimized for Pin engagement, can find exact products, and semantically understand camera images well.


We’ve launched unified embeddings and replaced the specialized embeddings across all visual search products at Pinterest. The embedding unification has allowed us to simplify our training and serving infrastructure and iterate globally across all products so we can move faster towards our most important objective: building and improving Pinterest for Pinners.

Thanks for reading! More details of this work are presented in our KDD’19 paper.

Acknowledgments: Visual Embeddings is a collaborative effort at Pinterest. Special thanks to Hao-Yu Wu, Eric Tzeng, Dong Huk Park, Chuck Rosenberg, Raymond Shiau, Kunlong Gu, Zhiyuan Zhang, Josh Beal, Eric Kim, Jeffrey Harris, Angela Guo, Dmitry Kislyuk, and Michael Feng for the collaboration on this project.

Pinterest Engineering Blog

Inventive engineers building the first visual discovery engine, 200 billion ideas and counting.