Pinterest’s Visual Lens: How computer vision explores your taste
The science behind personalized visual recommendations
When it comes to looking for something you want to try — a new salad recipe, a new classy dress, a new chair for your living room — you really need to see it first. Humans are visual creatures. We use our eyes to decide if something looks good, or if it matches our style.
I’m a huge fan of Pinterest, and particularly its Visual Lens. Why? It allows me to discover things that intrigue me, and allow me to go inside. When I spot something out in the world that looks interesting, but when I try to search for it online later, words fail me. I have this rich, colorful picture in my mind, but I can’t translate into the words I need to find it. Pinterest’s Visual Lens is a way to discover ideas without having to find the right words to describe them first.
Just point Lens at a pair of shoes, then tap to see related styles or even ideas for what else to wear them with. Or try it on a table to find similar designs, and even other furniture from the same era. You can also use Lens with food. Just point it at cauliflower or potatoes to see what recipes come up. Patterns and colors can also lead you in fun, interesting or even just plain weird new directions.
So how does Pinterest do such an amazing job of searching through vision and personalizing visual recommendations for its users? After two weeks of digging through the company’s engineering blog and press exposure, I feel grateful to have finally gotten a glimpse behind the curtain. It turns out that the product is an instance of machine learning applications at Pinterest, which are widespread in a variety of business areas. Let’s zoom out for a second to look at how machine learning is being used at Pinterest.
An Overview of Machine Learning Usage at Pinterest
As a visual discovery engine, Pinterest has many challenging problems that can be solved using machine learning techniques:
- What interests shall we recommend to a new user?
- How to generate an engaging home-feed?
- How do pins relate to each other?
- What interests does a pin belong to?
The critical moment happened in January 2015, when Pinterest acquired Kosei — a machine learning startup with expertise in recommender systems. Since then, machine learning has been used across Pinterest in multiple areas: from the Discovery team that provides recommendations, related content, and predicts the likelihood that a person will pin content; to the Growth team that uses intelligence models to determine which emails to send and prevent churn; from the Monetization team that does ad performance and relevance prediction; to the Data team that builds out a real-time distributed system for machine learning with Spark.
Let’s dig a little bit deeper into how Pinterest engineers are leveraging machine learning to keep the website’s 175 million+ users pinning and sharing:
- Identifying Visual Similarities: Machine learning can not only determine the subject of an image, it can also identify visual patterns and match them to other photos. Pinterest is using this technology to process 150 million image searches per month, helping users find content that looks like pictures they’ve already pinned.
- Categorizing and Curating: If a user pins a mid-century dining-room table, the platform can now offer suggestions of other objects from the same era. The key? Metadata, such as the names of pinboards and websites where images have been posted, helps the platform understand what photos represent.
- Predicting Engagement: While many platforms prioritize content from a user’s friends and contacts, Pinterest pays more attention to an individual’s tastes and habits — what they’ve pinned and when — enabling the site to surface more personalized recommendations.
- Prioritizing Local Taste: Pinterest is an increasingly global platform, with more than half of its users based outside the U.S. Its recommendation engine has learned to suggest popular content from users’ local region in their native language.
- Going Beyond Images: Analyzing what’s in a photo is a big factor in the site’s recommendations, but it doesn’t offer the whole story. Pinterest also looks at captions from previously pinned content and which items get pinned to the same virtual boards. That allows Pinterest to, say, link a particular dress to the pair of shoes frequently pinned alongside it, even if they look nothing alike.
Pinterest Lens is a product that is part of the effort to identify visual similarities, along with a host of other engineering works. They all utilize machine learning algorithms and technologies under an ever-growing field called computer vision, which I am going to explain in depth below.
A Brief History of Computer Vision at Pinterest
Computer vision is a field of computer science and subfield of machine learning that works on enabling computers to see, identify and process images in the same way that human vision does, and then provide the appropriate output. It is like imparting human intelligence and instincts to a computer. Pinterest uses computer vision heavily to power their visual discovery products.
Pinterest set its sights on visual search in 2014. That year, the company acquired VisualGraph, an image-recognition startup, and established its computer vision team with a small group of engineers and began to show its work.
In 2015, it launched visual search, a way to search for ideas without text queries. For the first time, visual search gave people a way to get results even when they can’t find the right words to describe what they’re looking for.
In summer 2016, visual search evolved as Pinterest rolled out object detection, which finds all the objects in a pin’s image in real-time and serves related results. Since then, visual search has become one of its most-used features, with hundreds of millions of visual searches every month, and billions of objects detected.
Early 2017, it introduced 3 new products on top of visual discovery infrastructure:
- Pinterest Lens is a way to discover ideas with a phone’s camera inspired by what users see in the world around them.
- Shop The Look is a way to shop and buy products users see inside Pins.
- Instant Ideas is a way to transform users’ home feed with similar ideas in just a tap.
Most recently about 2 months back, it announced a couple more ways to find products and ideas for users:
- Lens Your Look is a new way to find outfit ideas inspired by your wardrobe and the next big steps for Pinterest Lens.
- Responsive Visual Search is a seamless and immersive way to search images through zooming into pin objects.
- Pinterest Pincodes, in which you just pull out the Pinterest camera and scan any Pincode to see curated ideas on Pinterest inspired by what you’re looking at in the real world.
Let’s dig deeper into the computer vision models that Pinterest engineers employ for their visual discovery work behind Pinterest Lens.
How Pinterest Lens work
1 — Lens Architecture:
Lens combines Pinterest’s understanding of images and objects with its discovery technologies to offer Pinners a diverse set of results. For example, if you take a picture of a blueberry, Lens doesn’t just return blueberries: it also gives you more results such as recipes for blueberry scones and smoothies, beauty ideas like detox scrubs or tips for growing your own blueberry bush.
To do this, Lens’ overall architecture is separated into two logical components.
- The first component is a query understanding layer where Pinterest derives information regarding the given input image. Here Pinterest computes visual features such as detecting objects, computing salient colors and detecting lighting and image quality conditions. Using the visual features, it also computes semantic features such as annotations and category.
- The second component is Pinterest’s blender, as the results Lens returns come from multiple sources. Pinterest uses visual search technology to return visually similar results, object search technology to return scenes or projects with visually similar objects and image search which uses the derived annotations to return personalized text search results that are semantically (not visually) relevant to the input image. It’s the job of the blender to dynamically change blending ratios and result sources based on the information derived in the query understanding layer.
As shown above, Lens results aren’t strictly visually similar, they come from multiple sources, some of which are only semantically relevant to the input image. By giving Pinners results beyond visually similar, Lens is a new type of visual discovery tool that bridges real-world camera images to the Pinterest taste graph.
Let’s go ahead and dissect that blender component of Lens, which include Image, Object, and Visual Search.
2 — Image Search:
Pinterest’s Image Search technology dates back in 2015 when the company shared a white paper that details its system architecture and insights from experiments to build a scalable machine vision pipeline. Pinterest conducted a comprehensive set of experiments using a combination of benchmark datasets and A/B testing on two Pinterest applications, Related Pins and an experiment with similar looks.
In particular, the experiment with similar looks allowed Pinterest to show visually similar Pin recommendations based on specific objects in a Pin’s image. It experimented with different ways to use surface object recognition that would enable Pinner to click into the objects. Then it used object recognition to detect products such as bags, shoes and skirts from a Pin’s image. From these detected objects, it extracted visual features to generate product recommendations (“similar looks”). In the initial experiment, a Pinner would discover recommendations if there was a red dot on the object in the Pin. Clicking on the red dot loads a feed of Pins featuring visually similar objects.
3 — Visual Search:
Visual search was improved dramatically when Pinterest introduced automatic object detection for the most popular categories on Pinterest in 2016, so people can visually search for products within a Pin’s image.
Since an image can contain dozens of objects, Pinterest’s motivation was to make it as simple as possible to start a discovery experience from any of them. In the same way auto-complete improves the experience of text search, automatic object detection makes visual search a more seamless experience. Object detection in visual search also enables new features, like object-to-object matching. For example, say you spot a coffee table you love either on Pinterest or at a friend’s house, soon you’ll be able to see how it would look in many different home settings.
Pinterest’s first challenge in building automatic object detection was collecting labeled bounding boxes for regions of interest in images as our training data. Since launch, it has processed nearly 1 billion image crops (visual searches). By aggregating this activity across the millions of images with the highest engagement, it learns which objects Pinners are interested in. It aggregates annotations of visually similar results to each crop and assigns a weak label across hundreds of object categories. An example of how this looks is shown in the heat map visualization below, where two clusters of user crops are formed, one around the “scarf” annotation, and another around the “bag” annotation.
Since Pinterest’s visual search engine can use any image as a query — including unseen content from the web and even your camera — detection must happen in real-time, in a fraction of a second. One of the most widely used detection models Pinterest has experimented with extensively is Faster R-CNN, which uses a deep network to detect objects within images in two major steps.
First, it identifies regions of an image that are likely to contain objects of interest by running a fully convolutional network over the input image to produce a feature map. For each location on the feature map, the network considers a fixed set of regions, varying in size and aspect ratio, and uses a binary softmax classifier to determine how likely each of these regions is to contain an object of interest. If a promising region is found, the network also outputs adjustments to this region so that it better frames the objects.
Once the network has found regions of interest, it examines the most promising ones and attempts to either identify each as a particular category of object or discards it if no objects are found. For each candidate region, the network performs spatial pooling over the corresponding portion of a convolutional feature map, thereby producing a feature vector with a fixed size independent of the size of the region. This pooled feature is then used as the input to a detection network, which uses a softmax classifier to identify each region as either background or one of our object categories. If an object is detected, the network once again outputs adjustments to the region boundaries to further refine detection quality. Finally, a round of non-maximum suppression (NMS) is performed over the detections to filter out any duplicate detections, and the results are presented to the user.
4 — Object Search:
Traditionally, visual search systems have treated whole images as the unit. These systems index global image representations to return images similar holistically to the given input image. With better image representations as a result of advancements in deep learning, visual search systems have reached an unprecedented level of accuracy. However, Pinterest wanted to push the bounds of visual search technology to go beyond the whole image as the unit. By utilizing its corpus of billions of objects, combined with its real-time object detector, Pinterest can understand images on a more fine-grained level. Now, it knows both the location and the semantic meaning of billions of objects in its image corpus.
Object search is a visual search system that treats objects as the unit. Given an input image, Pinterest finds the most visually similar objects in billions of images in a fraction of a second, map those objects to the original image and return scenes containing the similar objects.
The Future of Visual Discovery at Pinterest
In a world where everyone has a camera in her pocket, many experts believe that visual search — taking photos instead of searching via text queries — will become the de facto way we look up information.
Pinterest is sitting on what might be the cleanest, biggest data set in the world to train computers to see images–the equivalent of a small nation hiding a nuclear armament. That’s billions of photos of furniture, food, and clothing, that have been hand-labeled by Pinterest’s own users for years.
At Pinterest, users come to casually window shop a better life, starting with remarkably unspecific queries like “dinner ideas” or “fashion” they often might search again and again, week after week. As a result of both this behavior and the site’s gridded layout of photo pins, Pinterest can build visual search into its platform, not to offer one perfect answer, but an imperfect collection of inspiration.
According to Pinterest’s CEO, Ben Silbermann, the company is doing 3 things with computer vision. They try to understand what are the aesthetic qualities of a product or a service, so they can do better recommendations in general. They want to be able to look inside an image that has multiple items, zoom in on part of it and computationally say, “Hey, it’s this type of object. Here’s where you can find something similar.” Then, eventually, they want to make the camera tool that you can query the world around you. Computer vision is the fundamental technology that powers all 3 of these things.
Pinterest has a fundamental principle guiding their work on computer vision: To help people discover and do things that they love. Vision is a rare, uncolonized space. Text sharing? That quadrant belongs to Facebook and Twitter. Vision sharing? Facebook, Instagram, and Snapchat. Searching text? That’s Google and Bing. But searching through vision? Pinterest seems to be leading the way.
I hope this was informative and tickled your curiosity like it did mine. For now, I’ll be working my way through my own Pinterest Lens, discovering my new favorite objects, knowing and appreciating all the computer vision that’s going on behind the scenes.