Pin2Interest: A scalable system for content classification

Published in

Pinterest Engineering Blog

7 min readAug 16, 2019

Eileen Li | Software Engineer, Knowledge

Content understanding is at the core of any great recommendation system. By knowing what each Pin is about, Pinterest can connect Pinners with the most relevant and inspiring content that best matches their interests.

Taxonomy of interests

The Interest Taxonomy captures popular concepts that appear within Pinterest and organizes them into a structured hierarchy. This highly-curated taxonomy tree currently has 10 levels of granularity, with 24 top-level concepts such as “women’s fashion” and “DIY crafts” and tens of thousands of total interests. The structure and vocabulary will continue to expand and grow. To provide a feel for the type of interests we have, here is a sample of our “Home Decor” vertical:

The Interest Taxonomy is the centralized way by which we classify content, including Pins, boards, Pinners, and search queries.

Mapping Pins to interests

We built Pin2Interest (P2I) to map our corpus of 200B+ Pins into a dynamic taxonomy. P2I provides high quality, human-understandable labels for Pins that belong to our rich taxonomy of interests.

For the Pin above, we get P2I results, in format (taxonomy_level, label, score):

(3, dogs, 0.83)
(2, mammals, 0.7)
(1, animals, 0.82)
(3, skiing, 0.83)
(3, snowboarding, 0.79)
(2, winter sports, 0.75)
(3, snowboarding, 0.79)
(2, winter sports, 0.75)
(3, ski trips, 0.76)
(2, travel ideas, 0.65)
(1, travel, 0.71)

In the taxonomy, dogs is a child of mammals, which in turn is a child of animals, while skiing and snowboarding are children of winter sports.

Case studies

P2I is used extensively across different systems at Pinterest. The results from P2I can be used to identify and remove unsafe content, generate personalized recommendations, and create ranking features for machine learning models. P2I results may also be directly exposed to Pinners.

Below are some examples of P2I signal consumers:

User to Interest (U2I) — mapping of users to interests
Query to Interest (Q2I) — mapping of search queries to interests
Home feed ranking — recommendations on your home feed
Search ranking & retrieval — recommendations from search results
Ads interest targeting & retrieval — product for showing promoted content

Using P2I as ranking features

P2I is the foundation on top of which other X2I systems are built. The result is a common classification space in which relations may be made between users, queries, boards, Pins, etc.

We create ranking features in our search system from the similarity of search queries and Pins’ interests, and we do something similar on our home feed between users and Pins. In fact, we take advantage of the Interests taxonomy structure and create different features for the various levels of the taxonomy, representing broad to more granular matching of interests.

Example:

Feature1 — cos_sim_l1_interests=sum(product of matching L1 interests of pin/query)
Feature2 — cos_sim_l2_interests=sum(product of matching L2 interests of pin/query)

Using P2I for ads interest targeting

Interest targeting is an advertising product that allows advertisers to select the audience they want to reach by choosing categories from the Interest Taxonomy.

For the following query Pin:

We show some related results on Pin closeup:

Promoted Pins, labeled with “Promoted By”, are Pins that have been promoted by advertisers and generate revenue for Pinterest. Using our interest targeting product, each Promoted Pin has advertiser-specified interests, such as Travel and Outdoor Sports. The P2I signal is used to enforce interest targeting by making sure the query Pin shares the same interests as the ones targeted. In the example above, the two Promoted Pins circled in red share the same Travel interest as the query Pin, while other candidates without any overlapping interests have been filtered out.

How it works

P2I is a machine learning system roughly comprised of two modules: candidate generation and ranking. The pipeline is written in Scalding (Scala + MapReduce) and runs daily to reclassify interests for our corpus of Pins.

Candidate generation

In candidate generation, we use relatively cheap, high-recall methods to obtain interests that may be relevant for each Pin. In this stage, we generate at most 200 candidates (70 on average) from the following methods:

Lexical Expansion

We obtain a small amount of highly precise matches by performing lexical expansion on text from the Pin. Lexical expansion may include matching terms with a low edit distance, reordering of annotations, lemmatization, etc.

Below are some examples:

Pin/Board co-occurrence

Using these highly precise lexical matches, we then cast a larger net by aggregating the co-occurring terms on Pins and co-occurring Pins on boards.

The result is a mapping of {pin: [list of interest candidates]}, which is the input for the second stage, ranking.

Ranking

We then score each (Pin, interest) pair from the candidate generation stage and keep those with the highest relevance scores in our final classification output. More specifically, we extract features for each candidate pair and rank them using a binary classifier (1 for relevant, 0 for irrelevant). We currently use gradient-boosting decision tree (GBDT) for our model.

Some of the features include:

Embedding similarity features (text embeddings such as FastText, Pin embeddings such as PinSage)
Tf-idf features from Pin annotations (where Pin is the document, and interest is the term)
Taxonomy features (taking advantage of hierarchy, such as L1 parent match)
Pin engagement features (gender, popularity, etc.)
Context features (type of Pin, country, etc.)

Fetching interests instantly

Newly created Pins are attached with interest predictions through an alternative flow we call “Instant Interests”. This flow shares much of the same logic as the daily batch workflow, with the caveat that we have a lot less information about a new Pin. The same model is triggered with only features available during Pin creation. These predictions are available before the daily batch workflow is run, and they are important for the activation and distribution of fresh Pins.

Scaling P2I

The system is designed to accommodate a growing taxonomy. When new interests are added, P2I just works without any additional engineering work. No models need to be retrained since the features can be extracted for each new interest with the existing pipeline. This flexibility allows for the team to quickly iterate on requests from the product team (ex. trending interests) and reuse the same system for many versions of the taxonomy.

Another challenge the team faces is making the system work efficiently for hundreds of billions of Pins. We’ve built a modular system where the components can be worked on, debugged, and improved in parallel. Bottlenecks can be easily identified and flagged. Compression is leveraged whenever appropriate. Engineers have learned to develop with efficiency in mind at every step, since an operation that takes 0.01% more resources per Pin is amplified to something meaningful when multiplied by 200 billion.

Taking it international

Making Pinterest work internationally is incredibly important to the success of our business. P2I was designed to support additional countries with little to no engineering work. Our current system supports 17 languages and counting.

Each node of the interest taxonomy represents a concept, independent of any language. A cat Pin is about the same concept whether it’s “cat” or “chat” (French). Internally, we use numeric ids rather than text labels to uniquely identify interests. The text for interests are only used during lexical expansion in candidate generation. To aid with this stage, we provide translations of terms from language X to English prior to lexical expansion. As a result, adding additional languages to P2I is incredibly straightforward from an engineering perspective.

Conclusion

Pin2Interest maps Pins into a dynamic, highly curated taxonomy. The system is flexible and adapts to new requirements, such as interests taxonomy changes and international expansion. The result is a robust signal that categorizes our hundreds of billions of pieces of content and is used by systems across Pinterest.

Acknowledgments: The author would like to thank members of the P2I team — Jinyu Xie, Rui Huang, Song Cui, Miwa Takaki (PM) & Yunsong Guo (EM), and also the following people for their contributions — Attila Dobi, Dhananjay Shrouty, Heath Vinicombe, Rui Li, Troy Ma & the Stanford Protege Team.