Building Pin cohesion

Pinterest Engineering
Pinterest Engineering Blog
6 min readMar 28, 2019

Wenke Zhang, Sunny Chang, Qinglong Zeng, Andrey Gusev
Pinterest Engineering, Content Quality

At Pinterest, we’re building a visual discovery engine where ideas become actionable with links to more information. Once a Pin is saved, its linked content may be updated or expire over time, and so it’s critical to know the quality of the linked page behind the Pin to improve click-through experience. In this post, we’ll discuss how we built Spark and TensorFlow based ML pipelines to compute Pin and link page relatedness signals.


Pinterest hosts billions of Pins (images that are visual bookmarks) from Pinners. To keep the onsite and offsite content consistent, we reindex web documents incrementally daily and derive signals to measure relevance between Pin and linked page. We rely on embeddings of image and text content to measure visual and semantic similarity.

Figure 1. Example of related and unrelated pin and web page

The system is decomposed into: text cohesion signal that compares a Pin’s salient keywords and text content from its linked web page, image cohesion signal that detects image similarity between onsite and offsite images, image-text cohesion signal that compares image and text with visual classifier; and finally, blending classifier that merges these signals into a single measure of proximity between Pin and landing page content.

Figure 2. Pin cohesion signal overview

Text Relatedness

Text relatedness is an important signal of page cohesiveness. Intuitively, we’d like to see a Pin’s web page match its semantics. For example, if we see a Pin under a board named “modern furniture” with the title “white sofa”, we might expect to see it linked to a retail site with home decor products. To measure this similarity, we extract both onsite and offsite text signals and compare them in the space of textual embeddings.

As people save Pins to their boards, we get a sense of the context surrounding Pins thanks to the human-curated contents, such as the board names, the Pins’ titles and descriptions, and so on. These signals are usually rich enough for us to generate text labels for each Pin, serving as the onsite text signals when calculating text relatedness. As for offsite signals, we need to know what the web page is talking about. While there could be a ton of content types we can get from HTML, we focus on the page’s title, description, and main body text. Title and description fields usually convey the most important topic of a web page, and extracting main body gives us more precise details of the topic. After gathering the raw text data, we tokenize them and score the tokens with Okapi BM25 algorithm. Comparing to the classical TF-IDF algorithm, BM25 can better eliminate the difference of different text lengths and limit the effect of common terms occurring too many times. We also maintain a native web page corpus to ensure keyword quality. For example, website-specific words like “free shipping” and “privacy policy” are more likely to exist on a web page and would receive lower IDF scores. With keyword extraction, we can summarize text content of a link as a list of text labels, which serves as our offsite text signal.

We then compare the onsite and offsite text signals in an embedding space, where text are represented as vectors and semantically-similar phrases would be close to each other. Mapping the text signals we extracted to the continuous vector space, we can infer text relatedness between a Pin’s information and its link by calculating cosine similarity of the two.

Image Relatedness

Under our current model, image relatedness signal consists of two sub-components: image visual similarity and image semantic similarity.

Image visual similarity is a raw signal we designed to answer the question whether we can find an image on a web page which looks similar to the image saved on Pinterest. We can tackle this problem by computing visual similarity between the Pin image and each image from its web page, and then taking the maximum of the scores. To achieve high precision and recall, we utilized a well-trained image near-duplicate detection model to predict the visual similarity score of image pairs. The model is a TensorFlow feed forward neural network which takes advantage of transfer learning over visual embeddings. More details can be found in this blog.

Figure 3. Example of semantic-related Image Pair
Figure 4. Image Classification: “audrey hepburn style”, “1950s fashion”

However, the failure of visual similarity detection does not necessarily mean images are not cohesive. For example, these two images in figure 3 look quite different from each other, but are both showing the same product (a white 3-tier metal cart), and should be considered as a cohesive pair. Therefore, we built another raw signal to capture image semantic similarity by transforming images to text annotations. This image-to-text model was developed by training an image to text annotation classifier. We integrate a vision model that classifies an image to top 10K search queries with most traffic in the past year, as shown in figure 4. The image features are visual embeddings trained using metric learning that optimizes various vision tasks at Pinterest. The search queries are first filtered to remove popular misspelled and trending queries, and are binary encoded for training. The most engaged Pin images are deduplicated and treated as positive samples for each class. The classifier model contains two fully-connected layers, and is trained to optimize sigmoid cross entropy loss. With the images transformed to predicted text annotations, we can easily judge image semantic similarity just like text relatedness, where we compare onsite and offsite text signals in a textual embedding space.

Image and Text Relatedness

With Pinterest visual search, image is typically the focal point when users explore content, while text is more informative and actionable when user wants to get in depth into specific content that are of interest to them. Therefore it is also important to understand how well image and text are aligned between a Pin and its link page.

We currently use two methods to compare image and text: image classification and optical character recognition (OCR). Firstly, we reuse the classifier discussed earlier to transform an image to text, but compare the output text annotations with link page text. Secondly, we see many Pin images contain text that highlights the linked web page, especially on those high quality native creator Pins that are designed for Pinterest. We leverage OCR techniques to extract text from image to obtain descriptive information of the Pin. Once we map image to text, we compare image and text similarity using textual embeddings as described in the previous section.

Figure 5. OCR

The final set of raw signals consists of:

  • Text relatedness score to measure if the human-curated keywords of a Pin are relevant to a link’s text content
  • Image relatedness score to measure image visual and semantic similarity between Pin and link page images to understand if the web page contains images of related themes and styles
  • Image-text relatedness score that measures if a link’s text content is complementary to the Pin image topic
  • Text-image relatedness score to measure if a link’s images are relevant to its salient keywords extracted from onsite text content

With all signals in place, we build a binary classifier on a human labeled gold dataset to decide the overall Pin cohesion.


We built the Pin cohesion signal to measure Pin and link page similarity. Thanks to Pin cohesion, since November 2018 we’ve seen metrics gains in search and engagement. This signal powers search, recommendations and home feed surfaces as a ranking signal to improve content quality on Pinterest, and drive off site traffic.


Pin cohesion is a collaborative project in Pinterest. Special thanks to our interns and the following members of the team: Peter John Daoud, Renju Liu, Nick DeChant, Yan Sun, Omkar Panhalkar, Grace Chin, Jun Liu, Yang Xiao, Heath Vinicombe, Vincent Bannister, Jacob Hanger, and Zhuoyuan Li for all their contributions on this project.