Understanding Pins through keyword extraction

Heath Vinicombe | Software Engineer, Knowledge

Pinterest is well known for its visual discovery engine, but much of our content is also accompanied by text. As a result, it’s important for us to be able to understand text to ensure we recommend relevant ideas to Pinners. In this blog post, I’ll talk about “annotations,” one of the main signals we use at Pinterest for understanding text.

Overview of annotations

Annotations are short keywords or phrases between one and six words that describe the subject of the Pin.

In addition to its text content, each annotation also has a confidence score and a language associated with it. We extract multiple annotations per Pin across a total of 28 languages.

Image for post
Image for post

For the above example Pin, we might extract the following annotations:

  • (EN, sloth sanctuary, 0.99)
  • (EN, sloths, 0.95)
  • (EN, costa rica, 0.90)
  • (EN, carribean, 0.85)
  • (EN, animals, 0.80)
  • (EN, travel, 0.80)

Using annotations

Annotations are a fundamental signal used in a variety of product surfaces at Pinterest, often as features within Machine Learning models. We’ve seen great experiment metrics gains from adding new annotation-based features to models, and typically annotations are one of the most important features. Examples of where we use annotations include:

  • Ads CTR prediction
  • Home feed candidate generation & ranking
  • Related Pins candidate generation & ranking
  • Search retrieval & ranking
  • Board suggestions for new Pins
  • Detecting unsafe content

Case studies

Annotations are stored in the inverted index. When a user performs a search, annotations are used to retrieve Pins with annotations matching the user’s query. The advantages of storing annotations in the inverted index rather than storing all tokens are:

  • Annotation scores tend to be more correlated with relevance than just TF-IDF alone
  • Storing just the annotations uses less space than storing all tokens, which is important when there are over 200 billion Pins to index

Related Pins are the list of recommendations you see under “more like this” after tapping on a Pin. Annotations are used to generate some of the features used by the related Pins model. In particular, the annotations for a Pin can be thought of as a sparse vector with indices corresponding to annotation ids and values corresponding to annotation scores. The cosine similarity between the annotation vectors of two Pins is a good measure of the relatedness of the two Pins. In the figure below, the two Maseratis are more similar to one another than they are to the Honda, and this is reflected in the cosine similarity scores.

Image for post
Image for post

Pinterest works hard on classifying content that goes against our community guidelines, such as self-injury and pornography. Annotations are one of the signals used by content safety filters to detect unsafe content and prevent our Pinners from encountering it.

System Overview

The workhorse of the annotation system is a weekly Scalding batch workflow to compute annotations for all Pins. However, the issue with such batch workflows is that there may be a multiple-day lag until annotations are computed for fresh Pins. To mitigate this, we also have an “Instant Annotator” service to compute annotations for fresh Pins within seconds of their creation and store the annotations in HBase. Annotation consumers can fallback to these instant annotations if the batch annotations have not yet been computed for a Pin.

See below for an overview of various components which will be talked about in more detail in the following sections.

Image for post
Image for post

Annotations dictionary

Image for post
Image for post

Annotations are limited to a finite vocabulary known internally as the Dictionary. This dictionary is stored in a MySQL database along with additional metadata. A UI makes it easy to view dictionary terms, add new terms, delete terms and view change logs.

The advantage of using such a dictionary over allowing annotations to be arbitrary ngrams is that it guarantees the annotations will be valid and useful phrases instead of misspellings (e.g., “recipies”), stopwords (e.g., “the”), fragments (e.g., “of liberty”) and generic phrases (e.g., “ideas”, “things”). Furthermore, the dictionary is a convenient place to store additional metadata such as translation and knowledge graph relations. This dictionary is used by many teams at Pinterest and not just for annotations.

The dictionary initially started with popular topics that were manually entered by users, but it has grown to include additional sources of terms such as search queries, hashtags, etc. A significant amount of human curation has gone into building the dictionary to ensure its quality is maintained, and we periodically use heuristics to trim out bad terms and use a spell checker to remove misspellings. We have around 100,000 terms in the dictionary for each language.

Candidate Extraction

The first step in computing annotations for a Pin is to extract potential candidates from a variety of text sources such as:

  • Pin title, description, url
  • Board name and description
  • Page title and description of the link
  • Search queries that frequently lead to clicks on the Pin
  • Names of objects detected in the image using a visual classifier

The following steps are used to extract candidates:

  1. A text language detector determines the language of the text.
  2. The text is tokenized into words with a tokenizer according to language.
  3. A sliding window is used to generate all ngrams containing between 1 and 6 words.
  4. The ngrams are normalized by stripping out accents and punctuation and then stemming or lemmatizing depending on the language.
  5. Ngrams are matched against the annotations dictionary.
  6. The extracted annotations are canonicalized to reduce duplication (e.g., “sloth” is canonicalized to “sloths” since it is not useful to have both of these annotations on a Pin). Canonical mappings are stored in the dictionary.

Features

Features are extracted for each annotation candidate to be later used for scoring.

Pin — Annotation features:

  • TF-IDF
  • Embedding similarity — cosine similarity between Pin embedding and annotation embedding
  • Source — some text sources tend to yield higher quality annotations than others, and annotations that were extracted from multiple sources (e.g., both Pin title and board title) tend to be better than annotations that were only present in a single source (e.g., just board title)

Annotation features:

  • IDF
  • Category Entropy — annotations that are popular across multiple categories tend to be more generic and less useful
  • Search frequency

We found our model performed better when we normalized our features such that the value distribution was similar across language and Pin popularity (i.e., number of repins).

Model

Not all annotations we extract as candidates are relevant to the Pin. For example, take the following Pin description:

“The Sloth Sanctuary in Costa Rica is the only sloth sanctuary in the world. Click to read more about my journey there + sees pics of baby sloths!”

From that description, we extract annotations such as “world”, “journey” and “read” that are not relevant to the Pin and do not make good keywords. The purpose of our model then is to score annotations so that we can filter out irrelevant ones and only keep the most useful ones.

Training labels are obtained through crowdsourcing where judges are asked to label for a given (Pin, annotation) pair whether the annotation is relevant to the Pin. Around 150,000 labels per language are used.

Initially, we started with a logistic regression model that predicts the probability that an annotation is relevant to the Pin. This yielded decent results and performed much better than previous versions of annotations that did not use a model. Later, we migrated to a XGradient Boosted Decision Tree model trained with XGBoost. Switching to this model gave us a 4% absolute improvement in precision and simplified our feature engineering since we could remove all monotonic feature transforms and no longer needed to impute values for missing features.

Conclusion

Incredibly useful signals with a variety of applications across recommendations, retrieval and ranking can be built from high-quality keywords. Pinterest has seen many engagement and relevance wins through adopting such signals.

Acknowledgments: Thanks to Anant Subramanian, Arun Prasad, Attila Dobi, Heath Vinicombe, Jennifer Zhao, Miwa Takaki, Troy Ma and Yunsong Guo for their contributions to this project.

Pinterest Engineering Blog

Inventive engineers building the first visual discovery…

Pinterest Engineering

Written by

https://medium.com/pinterest-engineering | Inventive engineers building the first visual discovery engine | https://careers.pinterest.com/

Pinterest Engineering Blog

Inventive engineers building the first visual discovery engine, 200 billion ideas and counting.

Pinterest Engineering

Written by

https://medium.com/pinterest-engineering | Inventive engineers building the first visual discovery engine | https://careers.pinterest.com/

Pinterest Engineering Blog

Inventive engineers building the first visual discovery engine, 200 billion ideas and counting.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store