Taste Graph part 1: Assigning interests to Pins
Brian Johnson | Pinterest Head of Knowledge Engineering
The Taste Graph is a collection of technologies and data that define a common vocabulary for understanding our content, users and partners. It helps us understand how a person’s interests and preferences evolve over time across categories like food, fashion, home decor and more. With more than 100B Pins and a growing user base of over 200M monthly active users, the Taste Graph deepens and expands every day. Over the years, we’ve made major investments to better understand Pins and how they relate to users who’ve saved them. Using the Taste Graph to improve our understanding allows us to serve more relevant recommendations across our app, whether it’s in home feed, Related Pins or Promoted Pins.
We recently rebuilt how we understand content in the Taste Graph, so we can serve more personalized recommendations and other results based on not just what someone is interested in, but what their taste is. Today we’re announcing that we’re now leveraging the same technology that powers personalization across all of Pinterest to enhance the relevancy and targeting of our ads. By incorporating the Taste Graph technology into our interest targeting, we’ve expanded the number of interests an advertiser can target to more than 5,000, a 10x increase. In early testing, many advertisers saw CTR increases over 50 percent and drops in CPCs over 20 percent.
In this post we’ll discuss a fundamental part of the Taste Graph–how we assign interests to Pins.
Taste Graph components
For context, there are 3 major components that make-up the Taste Graph:
- Pins: We need to understand what a Pin is about. One of the ways we do this is to associate and score metadata signals for each Pin, including user activity, interests, category relationships, demographic affinities, graph walks, and vector space embeddings.
- Users: We associate, aggregate and score signals for each user to understand what their interests are. These signals are boosted by recency and frequency of a user’s engagement with Pins.
- Interests: We create a multi-level graph of interests. These are a subset (thousands) of frequently encountered interests, organized into a hierarchy of levels by similarity and user activity.
Assigning interests to Pins
Interests are one of the foundational signals at Pinterest that ripple across all of our product surfaces. We define an interest as single words, such as “tree,” or short phrases such as “Apple TV” or “romantic travel destinations.” Building on previous work that assigned interests to Pins, recent focus areas for the team include:
- Improving precision
- Expanding interests per-Pin
- Expanding language coverage
- Improving space efficiency
At a high level, the process we use to assign interests to Pins is to:
- Extract text for all Pins in a PinJoin (a PinJoin is a collection of Pins with the same image)
- Normalize text fragments by tokenizing and lemmatizing
- Score interests with a machine learned model
Let’s go through each of these steps.
Pins are more than just images. They’re dynamic, rich collections of data from different web pages with metadata that evolves over time.
A PinJoin may contain thousands of Pins sharing the same image. For each Pin we collect the following text with its source. Text is an important feature in the machine-learned classifier we’ll discuss further below.
- Link text
- Board name
- Link alt text
- Image caption
- Page title
- Page meta title
- Page meta description
- Page meta keywords
Structured data such as Schema.org Text, Pinterest supports multiple category specific schemas, for recipes this data may include:
The point here isn’t to fully enumerate all possible text sources for a Pin, but rather to show that a Pin is a rich source of textual data.
We process the text for each Pin in a PinJoin and create a set of features suitable for machine-learned scoring.
for each pin in pinjoin:
for each text_source in pin:
extract interests from text_source
We extract interests from text by lemmatizing the text and then matching it against a dictionary. For example, “car,” “cars,” “car’s” and “cars’” all lemmatize to car.
Here’s a very short lemmatization example from nlp.stanford.edu:
am, are, is => be
car, cars, car's, cars' => car
the boy's cars are different colors => the boy car be differ color
Once we have the text lemmatized we match it against a dictionary of interests. This allows us to match only high quality, frequent text, and discard random text and blacklisted text such as spam. This gives us a set of (interest ID, source, frequency) triplets.
We’re also interested in space efficiency. At this point in the pipeline, we drop both interest text and language (italics above). Processing and storage continue with ID only. Numeric IDs significantly improve processing speed and reduce storage space requirements. The original interest text and language can be looked up by ID in the interest dictionary. Our English dictionary currently has ~3M interests, with ~50 percent in use.
There are additional techniques that can be used to further expand or collapse the set of interests associated with a PinJoin (which we’ll cover in a future post). These include synonyms, similar Pins, neighboring Pins and vector space embedding.
Now we’re ready to choose the best interests for a PinJoin. At the PinJoin level we may have thousands of Pins and tens of thousands of interests. We use supervised machine learning to score these interests. What we’re looking for is the top N interests per-language for each PinJoin. We’ve empirically chosen N=25, because from a cost/benefit standpoint <10 provides a sparse signal and more than 25 is too expensive. At our scale, data of low value takes too much time and space to process, serve and store.
Model training is based on a crowdsourced human judgment data set where we ask people to tell us whether a keyword is relevant to a particular Pin. For example, in the image below, a human judge would say that “San Francisco” is relevant to the Pin, “island” would be judged irrelevant.
Interest scoring features (sample)
- Word embedding (cohesive)
- Frequency counts from Pin, board and link texts (popularity)
- Normalized TF-IDF scores (uniqueness)
- Category affinities–do interests belong to the same or similar categories (cohesive)
- Position within text (importance)
- Whitelisting (wikipedia titles, vertical glossaries/taxonomies, entity dictionaries)
- Graph queries (cohesive)
- Pluralization (normalization)
- Head queries (importance)
We tried a variety of models, including logistic regression, support vector machines, decision trees, random forests, and simple feed forward nets. We chose to use logistic regression for both simplicity and result quality.
Error analysis and evaluation metrics
Offline metrics are essential. We use human judgment data to facilitate rapid offline iteration and evaluation. We always run online experiment(s) before a full production launch to validate offline results and measure real world gains.
Results are good, but there’s still room for improvement. We improved precision from 71 percent to 79 percent. On average we have ~8 interests per-Pin, per-language, with significantly higher coverage for our more popular Pins. We currently have interest dictionaries for 32 languages, with 23 languages active in production.
Let’s wrap up with a bit of humor. Large data sets, excellent human curation, natural language processing and machine learning can provide great interest tagging and the excellent recommendations visible on Pinterest.
Stay tuned for the follow on Taste Graph blog posts where we’ll cover how we assign interests to users and how we map interests into a hierarchy.
Acknowledgements: Thanks to all of my colleagues who have helped with this projects and provided valuable feedback. These include, but are not limited to, John, Andrew, Arun, Heath, Poorvi, Luke, Davina, Jason, the Knowledge team, and the Content team. Brian joined Pinterest in 2017 as the Head of Knowledge. He was previously at eBay, Handspring, Excite@Home, Synopsys, and AT&T Bell Labs. Brian received his Ph.D. in Computer Science from the University of Maryland.
This blog post expands on the following posts: