Understanding Pins through keyword extraction
Heath Vinicombe | Software Engineer, Knowledge
Pinterest is well known for its visual discovery engine, but much of our content is also accompanied by text. As a result, it’s important for us to be able to understand text to ensure we recommend relevant ideas to Pinners. In this blog post, I’ll talk about “annotations,” one of the main signals we use at Pinterest for understanding text.
Overview of annotations
Annotations are short keywords or phrases between one and six words that describe the subject of the Pin.
In addition to its text content, each annotation also has a confidence score and a language associated with it. We extract multiple annotations per Pin across a total of 28 languages.
For the above example Pin, we might extract the following annotations:
- (EN, sloth sanctuary, 0.99)
- (EN, sloths, 0.95)
- (EN, costa rica, 0.90)
- (EN, carribean, 0.85)
- (EN, animals, 0.80)
- (EN, travel, 0.80)
Annotations are a fundamental signal used in a variety of product surfaces at Pinterest, often as features within Machine Learning models. We’ve seen great experiment metrics gains from adding new annotation-based features to models, and typically annotations are one of the most important features. Examples of where we use annotations include:
- Ads CTR prediction
- Home feed candidate generation & ranking
- Related Pins candidate generation & ranking
- Search retrieval & ranking
- Board suggestions for new Pins
- Detecting unsafe content
Annotations are stored in the inverted index. When a user performs a search, annotations are used to retrieve Pins with annotations matching the user’s query. The advantages of storing annotations in the inverted index rather than storing all tokens are:
- Annotation scores tend to be more correlated with relevance than just TF-IDF alone
- Storing just the annotations uses less space than storing all tokens, which is important when there are over 200 billion Pins to index
Related Pins are the list of recommendations you see under “more like this” after tapping on a Pin. Annotations are used to generate some of the features used by the related Pins model. In particular, the annotations for a Pin can be thought of as a sparse vector with indices corresponding to annotation ids and values corresponding to annotation scores. The cosine similarity between the annotation vectors of two Pins is a good measure of the relatedness of the two Pins. In the figure below, the two Maseratis are more similar to one another than they are to the Honda, and this is reflected in the cosine similarity scores.
Content Safety Filter
Pinterest works hard on classifying content that goes against our community guidelines, such as self-injury and pornography. Annotations are one of the signals used by content safety filters to detect unsafe content and prevent our Pinners from encountering it.
The workhorse of the annotation system is a weekly Scalding batch workflow to compute annotations for all Pins. However, the issue with such batch workflows is that there may be a multiple-day lag until annotations are computed for fresh Pins. To mitigate this, we also have an “Instant Annotator” service to compute annotations for fresh Pins within seconds of their creation and store the annotations in HBase. Annotation consumers can fallback to these instant annotations if the batch annotations have not yet been computed for a Pin.
See below for an overview of various components which will be talked about in more detail in the following sections.
Annotations are limited to a finite vocabulary known internally as the Dictionary. This dictionary is stored in a MySQL database along with additional metadata. A UI makes it easy to view dictionary terms, add new terms, delete terms and view change logs.
The advantage of using such a dictionary over allowing annotations to be arbitrary ngrams is that it guarantees the annotations will be valid and useful phrases instead of misspellings (e.g., “recipies”), stopwords (e.g., “the”), fragments (e.g., “of liberty”) and generic phrases (e.g., “ideas”, “things”). Furthermore, the dictionary is a convenient place to store additional metadata such as translation and knowledge graph relations. This dictionary is used by many teams at Pinterest and not just for annotations.
The dictionary initially started with popular topics that were manually entered by users, but it has grown to include additional sources of terms such as search queries, hashtags, etc. A significant amount of human curation has gone into building the dictionary to ensure its quality is maintained, and we periodically use heuristics to trim out bad terms and use a spell checker to remove misspellings. We have around 100,000 terms in the dictionary for each language.
The first step in computing annotations for a Pin is to extract potential candidates from a variety of text sources such as:
- Pin title, description, url
- Board name and description
- Page title and description of the link
- Search queries that frequently lead to clicks on the Pin
- Names of objects detected in the image using a visual classifier
The following steps are used to extract candidates:
- A text language detector determines the language of the text.
- The text is tokenized into words with a tokenizer according to language.
- A sliding window is used to generate all ngrams containing between 1 and 6 words.
- The ngrams are normalized by stripping out accents and punctuation and then stemming or lemmatizing depending on the language.
- Ngrams are matched against the annotations dictionary.
- The extracted annotations are canonicalized to reduce duplication (e.g., “sloth” is canonicalized to “sloths” since it is not useful to have both of these annotations on a Pin). Canonical mappings are stored in the dictionary.
Features are extracted for each annotation candidate to be later used for scoring.
Pin — Annotation features:
- Embedding similarity — cosine similarity between Pin embedding and annotation embedding
- Source — some text sources tend to yield higher quality annotations than others, and annotations that were extracted from multiple sources (e.g., both Pin title and board title) tend to be better than annotations that were only present in a single source (e.g., just board title)
- Category Entropy — annotations that are popular across multiple categories tend to be more generic and less useful
- Search frequency
We found our model performed better when we normalized our features such that the value distribution was similar across language and Pin popularity (i.e., number of repins).
Not all annotations we extract as candidates are relevant to the Pin. For example, take the following Pin description:
“The Sloth Sanctuary in Costa Rica is the only sloth sanctuary in the world. Click to read more about my journey there + sees pics of baby sloths!”
From that description, we extract annotations such as “world”, “journey” and “read” that are not relevant to the Pin and do not make good keywords. The purpose of our model then is to score annotations so that we can filter out irrelevant ones and only keep the most useful ones.
Training labels are obtained through crowdsourcing where judges are asked to label for a given (Pin, annotation) pair whether the annotation is relevant to the Pin. Around 150,000 labels per language are used.
Initially, we started with a logistic regression model that predicts the probability that an annotation is relevant to the Pin. This yielded decent results and performed much better than previous versions of annotations that did not use a model. Later, we migrated to a XGradient Boosted Decision Tree model trained with XGBoost. Switching to this model gave us a 4% absolute improvement in precision and simplified our feature engineering since we could remove all monotonic feature transforms and no longer needed to impute values for missing features.
Incredibly useful signals with a variety of applications across recommendations, retrieval and ranking can be built from high-quality keywords. Pinterest has seen many engagement and relevance wins through adopting such signals.
Acknowledgments: Thanks to Anant Subramanian, Arun Prasad, Attila Dobi, Heath Vinicombe, Jennifer Zhao, Miwa Takaki, Troy Ma and Yunsong Guo for their contributions to this project.