Ningning Hu | Pinterest engineer, Discovery
The core value of Pinterest is to help people find the things they care about, by connecting them to Pins and people that relate to their interests. We’re building a service that’s powered by people, and supercharged with technology.
The interest graph — the connections that make up the Pinterest index — creates bridges between Pins, boards, and Pinners. It’s our job to build a system that helps people to collect the things they love, and connect them to communities of engaged people who share similar interests and can help them discover more. From categories like travel, fitness, and humor, to more niche areas like vintage motorcycles, craft beer, or Japanese architecture, we’re building a visual discovery tool for all interests.
The interests platform is built to support this vision. Specifically, it’s responsible for producing high quality data on interests, interest relationships, and their association with Pins, boards, and Pinners.
Figure 1: Feedback loop between machine intelligence and human curation
In contrast with conventional methods of generating such data, which rely primarily on machine learning and data mining techniques, our system relies heavily on human curation. The ultimate goal is to build a system that’s both machine and human powered, creating a feedback mechanism by which human curated data helps drive improvements in our machine algorithms, and vice versa.
Figure 2: System components
Raw input to the system includes existing data about Pins, boards, Pinners, and search queries, as well as explicit human curation signals about interests. With this data, we’re able to construct a continuously evolving interest dictionary, which provides the foundation to support other key components, such as interest feeds, interest recommendations, and related interests.
Generating the interest dictionary
From a technology standpoint, interests are text strings that represent entities for which a group of Pinners might have a shared passion.
We generated an initial collection of interests by extracting frequently occurring n-grams from Pin and board descriptions, as well as board titles, and filtering these n-grams using custom built grammars. While this approach provided a high coverage set of interests, we found many terms to be malformed phrases. For instance, we would extract phrases such as “lamborghini yellow” instead of “yellow lamborghini”. This proved problematic because we wanted interest terms to represent how Pinners would describe them, and so, we employed a variety of methods to eliminate malformed interests terms.
We first compared terms with repeated search queries performed by a group of Pinners over a few months. Intuitively, this criterion matches well with the notion that an interest should be an entity for which a group of Pinners are passionate.
Later we filtered the candidate set through public domain ontologies like Wikipedia titles. These ontologies were primarily used to validate proper nouns as opposed to common phrases, as all available ontologies represented only a subset of possible interests. This is especially true for Pinterest, where Pinners themselves curate special interests like “mid century modern style.”
Finally, we also maintain an internal blacklist to filter abusive words and x-rated terms as well as Pinterest specific stop words, like “love”. This filtering is especially important to interest terms which might be recommended to millions of users.
We arrived at a fair quality collection of interests following the above algorithmic approaches. In order to understand the quality of our efforts, we gave a 50,000 term subset of our collection to a third party vendor which used crowdsourcing to rate our data. To be rigorous, we composed a set of four criteria by which users would evaluate candidate Interests terms:
- Is it English?
- Is it a valid phrase in grammar?
- Is it a standalone concept?
- Is it a proper name?
The crowdsourced ratings were both interesting if not somewhat expected. There was a low rate of agreement amongst raters, with especially high discrepancy in determining whether an interest’s term represented a “standalone concept.” Despite the ambiguity, we were able to confirm that 80% of the collection generated using the above algorithms satisfied our interests criteria.
This type of effort, however, is not easy to scale. The real solution is to allow Pinners to provide both implicit and explicit signals to help us determine the validity of an interest. Implicit signals behaviors like clicking and viewing, while explicit signals include asking Pinners to specifically provide information (which can be actions like a thumbs up/thumbs down, starring, or skipping recommendations).
To capture all the signals used for defining the collections of terms, we built a dictionary that stores all the data associated with each interest, including invalid interests and the reason why it’s invalid. This service plays a key role in human curation, by aggregating signals from different people. On top of this dictionary service, we can build different levels of reviewing system.
Identifying Pinner interests
With the Interests dictionary, we can associate Pins, boards, and Pinners with representative interests. One of the initial ways we experimented with this was launching a preview of a page where Pinners can explore their interests.
Figure 3: Exploring interests
In order to match interests to Pinners, we need to aggregate all the information related with a person’s interests. At its core, our system recommends interests based upon Pins with which a Pinner interacts. Every Pin on Pinterest has been collected and given context by someone who thinks it’s important, and in doing so, is helping other people discover great content. Each individual Pin is an incredibly rich source of data. As discussed in a previous blog post on discovery data model, one Pin often has multiple copies — different people may Pin it from different sources, and the same Pin can be repinned multiple times. During this process, each Pin accumulates numerous unique textual descriptions which allows us to connect Pins with interests terms with high precision.
However, this conceptually simple process requires non-trivial engineering effort to scale to the amount of Pins and Pinners that the service has today. The data process pipeline (managed by Pinball) composes over 35 Hadoop jobs, and runs periodically to update the user-interest mapping to capture users’ latest interest information.
The initial feedback on the explore interests page has been positive, proving the capabilities of our system. We’ll continue testing different ways of exposing a person’s interests and related content, based on implicit signals, as well as explicit signals (such as the ability to create custom categories of interests).
Calculating related interests
Related interests are an important way of enabling the ability to browse interests and discover new ones. To compute related interests, we simply combine the co-occurrence relationship for interests computed at Pin and board levels.
Figure 4: Computing related interests
The quality of the related interests is surprisingly high given the simplicity of the algorithm. We attribute this effect to the cleanness of Pinterest data. Text data on Pins tend to be very concise, and contain less noise than other types of data, like web pages. Also, related interests calculation already makes use of boards, which are heavily curated by people (vs. machines) in regards to organizing related content. We find that utilizing the co-occurrence of interest terms at the level of both Pins and boards provides the best tradeoff between achieving high precision as well as recall when computing the related interests.
One of the initial ways we began showing people related content was through related Pins. When you Pin an object, you’ll see a recommendation for a related board with that same Pin so you can explore similar objects. Additionally, if you scroll beneath a Pin, you’ll see Pins from other people who’ve also Pinned that original object. At this point, 90% of all Pins have related Pins, and we’ve seen 20% growth in engagement with related Pins in the last six months.
Powering interest feeds
Interests feeds provide Pinners with a continuous feed of Pins that are highly related. Our feeds are populated using a variety of sources, including search and through our annotation pipeline. A key property of the feed is flow. Only feeds with decent flow can attract Pinners to come back repeatedly, thereby maintaining high engagement. In order to optimize for our feeds, we’ve utilized a number of real-time indexing and retrieval systems, including real-time search, real-time annotating, and also human curation for some of the interests.
To ensure quality, we need to guarantee quality from all sources. For that purpose, we measure the engagement of Pins from each source and address quality issue accordingly.
Figure 5: How interest feeds are generated
More to come
Accurately capturing Pinner interests and interest relationships, and making this data understandable and actionable for tens of millions of people (collecting tens of billions of Pins), is not only an engineering challenge, but also a product design one. We’re just at the beginning, as we continue to improve the data and design ways to empower people to provide feedback that allows us to build a hybrid system combining machine and human curation to power discovery. Results of these effort will be reflected in future product releases.
If you’re interested in building new ways of helping people discover the things they care about, join our team!
Acknowledgements: The core team members for the interests backend platform are Ningning Hu, Leon Lin, Ryan Shih and Yuan Wei. Many other folks from other parts of the company, especially the discovery team and the infrastructure teams, have provided very useful feedback and help along the way to make the ongoing project successful.
Ningning Hu is an engineer at Pinterest.