Building Pinterest Lens: a real world visual discovery system

Andrew Zhai | Pinterest tech lead, Visual Search

Recently, we announced Lens BETA, a new way to discover objects and ideas from the world around on you using your phone’s camera. Just tap the Lens icon in the Pinterest app, point it at anything and Lens will return visually similar objects, related ideas or the object in completed projects or contexts. Lens enables you to go beyond traditional uses of your phone’s camera–taking selfies or saving a scene–and turns it into a powerful discovery system. It brings the magic of Pinterest into the real world, so that anything you see can lead to a related idea on Pinterest. Here we’ll share how we built Lens and the main technical challenges we overcame.

Background

In 2015, we launched our first visual search experience which enables people to pinpoint parts of an image and get visually similar results. With visual search, we gained a platform to advance our technology and incrementally improve the system by optimizing for not only relevant results but engaging ones, too. Pinners have responded positively to these improvements and now generate more than 250 million unique visual searches every month.

As the next evolution of visual search, we introduced real-time object detection. This not only made visual search easier to use, but we also steadily gained a corpus of objects as people saved and selected them. Since its launch, we’ve generated billions of objects in just six month’s time, and have used this data to build new technologies, such as Lens and object search.

If you’re interested in a more in-depth look at how we scaled our visual search technology to billions of images and applied it across Pinterest, please take a look at our Visual Discovery at Pinterest paper that was accepted for publication at World Wide Web (WWW) conference this year.

Lens architecture

A single Pin can take you down a rabbit hole of related ideas, enabling you to discover high quality content from 150M people around the world. As we developed Lens, we wanted to parallel this experience, so a single real world camera image could connect you to the 100B ideas on Pinterest.

Lens combines our understanding of images and objects with our discovery technologies to offer Pinners a diverse set of results. For example, if you take a picture of a blueberry, Lens doesn’t just return blueberries: it also gives you more results such as recipes for blueberry scones and smoothies, beauty ideas like detox scrubs or tips for growing your own blueberry bush.

To do this, Lens’ overall architecture is separated into two logical components.

  1. The first component is our query understanding layer where we derive information regarding the given input image. Here we compute visual features such as detecting objects, computing salient colors and detecting lighting and image quality conditions. Using the visual features, we also compute semantic features such as annotations and category.
  2. The second component is our blender, as the results Lens returns come from multiple sources. We use our visual search technology to return visually similar results, object search technology to return scenes or projects with visually similar objects (more on this below) and image search which uses the derived annotations to return personalized text search results that are semantically (not visually) relevant to the input image. It’s the job of the blender to dynamically change blending ratios and result sources based on the information derived in the query understanding layer. For instance, image search won’t be triggered if our annotations are low confidence, and object search won’t be triggered if no relevant objects are detected.

As shown above, Lens results aren’t strictly visually similar, they come from multiple sources, some of which are only semantically relevant to the input image. By giving Pinners results beyond visually similar, Lens is a new type of visual discovery tool that bridges real world camera images to the Pinterest taste graph.

Building object search

Sometimes you see something you love, like a cool clock or a pair of sneakers, but you don’t know how to style the shoe or how the clock would look in a room. Object Search, a core component of Lens, is a new technology we built to address these problems.

With the advances of deep learning resulting in technology such as improved image representations and object detection, we can now understand images like never before.

Traditionally, visual search systems have treated whole images as the unit. These systems index global image representations to return images similar holistically to the given input image. With better image representations as a result of advancements in deep learning, visual search systems have reached an unprecedented level of accuracy. However, we wanted to push the bounds of visual search technology to go beyond the whole image as the unit. By utilizing our corpus of billions of objects, combined with our real-time object detector, we can understand images on a more fine grained level. Now, for the first time, we know both the location and the semantic meaning of billions of objects in our image corpus.

Object search is a visual search system that treats objects as the unit. Given an input image, we find the most visually similar objects in billions of images in a fraction of a second, map those objects to the original image and return scenes containing the similar objects.

Future of visual discovery

The BETA launch of Lens is really just the beginning. We’re continuing to improve our visual technologies to better understand images, as we face challenges where the image is the only available signal that we have to understand user intent. This is especially difficult in the case of real world camera images as people take photos in a variety of lighting conditions with inconsistent image quality and various orientations.

We’re excited by the possibilities that objects and visual search together can bring and are continuing to explore new ways of utilizing our massive scale of objects and images to build discovery products for Pinners around the world.

If you’re interested in tackling these computer vision challenges and building awesome products for Pinners, please join us!

Acknowledgements: Lens is a collaborative effort at Pinterest. We’d like to thank Maesen Churchill, Jeff Donahue, Shirley Du, Jamie Favazza, Michael Feng, Naveen Gavini, Jack Hsu, Yiming Jen, Jason Jia, Eric Kim, Dmitry Kislyuk, Vishwa Patel, Albert Pereta, Steven Ramkumar, Eric Sung, Eric Tzeng, Kelei Xu, Mao Ye, Zhefei Yu, Cindy Zhang, and Zhiyuan Zhang for the collaboration on the product launch, Trevor Darrell for his advisement, Yushi (Kevin) Jing, Vanja Josifovski and Evan Sharp for their support.