Introducing automatic object detection to visual search

Dmitry Kislyuk | Pinterest engineer, Visual Search

When we launched visual search last year, we gave a first look at what’s possible when you use images as search queries. Now, more than 130 million visual searches are done every month, as people search for the objects, styles and colors they see in Pins and get related recommendations. It’s a whole new kind of search, and a technological challenge.

Today, we’re introducing automatic object detection for the most popular categories on Pinterest, so people can visually search for products within a Pin’s image. As we look to the future of visual search, we’re also starting to preview new camera search technology that’ll give Pinners recommendations for the products they find in the real world. Pinners will soon be able to snap a photo of a single object like sneakers — and get recommendations on Pinterest, or even take a photo of an entire room and get results for multiple items.

Deep learning at Pinterest

Visual search is one of the many fields transformed in recent years by the advances in deep learning. Convolutional neural networks represent images and videos as feature vectors which preserve both semantic concepts and visual information, and allows for fast retrieval when using optimized nearest neighbor techniques. We leveraged this idea, along with our richly annotated image dataset, last November when we released a visual search product that makes searching inside a Pin’s image as simple as dragging a cropper. For our initial launch, we extracted the fully-connected-6 layer of a fine tuned VGG model over a billion Pinterest images and indexed them into a distributed service, as described in our KDD paper.

Your browser does not support the video tag.


Since an image can contain dozens of objects, we wanted to make it as simple as possible to start a discovery experience from any of them. In the same way auto-complete improves the experience of text search, automatic object detection makes visual search a more seamless experience. Object detection in visual search also enables new features, like object-to-object matching. For example, say you spot a coffee table you love either on Pinterest or at a friend’s house, soon you’ll be able to see how it would look in many different home settings.

Building automatic object detection

Our first challenge in building automatic object detection was collecting labeled bounding boxes for regions of interest in images as our training data. Since launch, we’ve processed nearly 1 billion image crops (visual searches). By aggregating this activity across the millions of images with the highest engagement, we learn which objects Pinners are interested in. We aggregate annotations of visually similar results to each crop and assign a weak label across hundreds of object categories. An example of how this looks is shown in the heatmap visualization below, where two clusters of user crops are formed, one around the “scarf” annotation, and another around the “bag” annotation.

Since our visual search engine can use any image as a query — including unseen content from the web and even your camera — detection must happen in real-time, in a fraction of a second. One of most widely used detection models we’ve experimented with extensively is Faster R-CNN, which uses a deep network to detect objects within images in two major steps. First, it identifies regions of an image that are likely to contain objects of interest by running a fully convolutional network over the input image to produce a feature map. For each location on the feature map, the network considers a fixed set of regions, varying in size and aspect ratio, and uses a binary softmax classifier to determine how likely each of these regions is to contain an object of interest. If a promising region is found, the network also outputs adjustments to this region so that it better frames the objects.

Once the network has found regions of interest, it examines the most promising ones and attempts to either identify each as a particular category of object or discards it if no objects are found. For each candidate region, the network performs spatial pooling over the corresponding portion of a convolutional feature map, thereby producing a feature vector with a fixed size independent of the size of the region. This pooled feature is then used as the input to a detection network, which uses a softmax classifier to identify each region as either background or one of our object categories. If an object is detected, the network once again outputs adjustments to the region boundaries to further refine detection quality. Finally, a round of non-maximum suppression (NMS) is performed over the detections to filter out any duplicate detections, and the results are presented to the user.

One of the key tricks that enables high-speed detection with Faster R-CNN is the convolutional features used in both the region proposer and the detection network are one and the same. A significant portion of the network latency is spent producing this intermediate convolutional feature map, and by sharing it between the two network components, we reduce the amount of redundant computation. This enables us to identify objects in a fraction of a second.

Your browser does not support the video tag.

Last year, we deployed our own implementation of this model to compute targeted visual similarity features in Related Pins, one of our recommendations products, which resulted in a 4 percent increase in engagement, as detailed in our technical report.

Since then, we’ve worked on improving both the accuracy and efficiency of this model by applying the recently published advances in deep residual networks (ResNets). Despite the resulting network consisting of more than 100 convolutional layers, we’ve focused on reducing the GPU memory footprint of this model to be suitable for deployment on AWS while keeping latencies under 300ms.

With real-time object detection for any image from anywhere, on Pinterest, the web or in the real world, visual search on Pinterest becomes even better. Object detection will roll out this out to all Pinners and platforms in the coming weeks.

The future of visual search

We’re also building technology that will help people get recommendations on Pinterest for products they discover in the real world, by simply taking a photo. This will enable a new kind of visual search experience, combining image retrieval, object detection and the power of our interest graph. Stay tuned for more information on camera search technology.

Visual Search is a collaborative effort at Pinterest, and we’d like to thank Kelei Xu, Vishwa Patel, Andrew Zhai, Shirley Du, Zhiyuan Zhang, Michelle Vu, Michael Feng and Kevin Jing, along with Eric Tzeng, Jeff Donahue, and Trevor Darrell from Berkeley Vision and Learning Center (BVLC). Additionally, we’d like to thank Mike Repass, Naveen Gavini and Albert Pereta for making the product launch possible.