PinText: A Multitask Text Embedding System in Pinterest

Jinfeng Zhuang | Software Engineer, Content Knowledge

Word embeddings have been actively studied and developed by researchers and practitioners in the machine learning community since the neural network language model was proposed. Researchers not only proposed principled algorithms and open-source code, but they also released pre-trained embedding models with public data corpora like Wikipedia, Twitter, or Google news. It is known that high-quality word representation usually contributes most to text classification or other general tasks.

In Pinterest, textual data is more often about concrete annotation terms and short phrases rather than long sentences and paragraphs, which makes word embedding a more fundamental component than complex neural network architectures for specific tasks. Although pre-trained word embeddings provide a good baseline, we do observe a clear gap between industrial applications and academic research, including storage cost, memory cost, availability of supervised information, throughput and latency. These gaps motivate us to do text embedding at a practical system level. Our key design goals include:

  • Use supervised information instead of unsupervised co-occurrence style word embedding like word2vec.
  • Do word-level instead of character-level embedding to make sure the learned embedding dictionary can be loaded into memory when necessary.
  • Learn a shared word embedding for all downstream tasks instead of an end-to-end embedding coupled with a particular application.
  • Eventually replace all open-sourced embeddings to save maintenance cost.

We have three surfaces on Pinterest: home feed, related Pins, and search. It is an intuitive choice to use multitask learning (MTL) to combine all the information together to make the learned model generalize better. The user engagement in each task naturally provides supervised information. After we learn the embedding with such information, we use either a Kubernetes+Docker solution or a map-reduce system for large-scale batch computation. We can also build an inverted index of embedding tokens for online search. Figure 1 illustrates the high-level architecture of the text embedding solution. Figure 2 demonstrates the three surfaces, a particular user profile, and a Pin.

Figure 1: A simplified representation of the PinText architecture consists of offline training, index building, and online serving. Left: We use Kafka to collect users’ engagement data to construct training data. Middle: We use locality-sensitive hashing (LSH) to compute embedding tokens and build an inverted index for each Pin. Embedding vectors and k-nearest neighbors (KNN) results can be cached. Right: Use LSH tokens of embedding vectors to retrieve Pins, and use embedding similarity in the ranking model.
Figure 2: Examples from the Pinterest iOS app of two core concepts: user (a) and Pin (b). Figures (c,d,e) presents examples of home feed, related Pin, and search for this particular user and the idea “Alaska train ride”. When users save and click content, we receive positive votes via the logging system. The learning task is to mine the semantic text embeddings behind such operations. Note we know voting results, but we don’t know who voted throughout this work.

In the discussion below, let HF, RP, and SR denote the home feed, related Pin and search tasks, respectively. Each training example is a pair of entities <q, p> in each task. Specifically,

  • In SR, q is a search query and p is a Pin. We extract the Pin’s title and description as the Pin’s text. Refer to figure 2(b) as an example.
  • In RP, q is a subject Pin and p is a related Pin.
  • In HF, q is the user’s interest text derived by our personalization team and p is a Pin.

We use the average of an entity’s associated word embeddings as the overall entity embedding. The learning objective is then to make the cosine similarity of a positive entity pair’s embeddings greater than that of a pairing with randomly sampled background entities. We use Pin saves and clicks to define positive pairs. Taking the SR task as an example, a positive training data pair <q, p> may be defined by a user saving or clicking Pin p in the results of search query q.

Figure 3: Simplified architecture for the multitask word-embedding model

Figure 3 shows an overview of the learning architecture. Now let’s give a mathematical formulation of the above concepts. Let D ∈ Rⁿ×d denote the learned dictionary, where n is the number of words and d is the dimension of embedding vectors. Given a word wᵢ, we derive its embedding function F(wᵢ) = wᵢ by exactly taking wᵢ ∈ Rᵈ as the i-th row of D. We compute the embedding of an entity q by averaging its word embeddings. Note we use bold lowercase characters to denote embedding row vectors (e.g., q := F(q)). In order to train a single-task embedding model, we define the objective function J(Ɛ) by enforcing that the similarity on a positive pair <q, p⁺> involving q is greater than the similarity on a few randomly sampled background pairs <q, p⁻>. Taking the search task as an example (with q being a search query):

where the ρ negative sampling ratio, μ rank margin loss and γ radius of embedding vectors are hyperparameters tuned through experimentation. We fix L and S to be hinge loss and cosine similarity, respectively. In this way, we implement the objective function design goal above. For a particular entity q, we enforce that its similarity x with a positive entity is greater than its similarity y with a random negative entity by a margin μ. Otherwise it introduces a loss μ − (x − y). The heuristic here is that a good semantic embedding should capture users’ engagement.

The multitask learning objective function is a simple aggregation of 3 learning tasks, where all tasks share the same word embedding lookup table (as represented in figure 3):

With this MTL objective function, we can apply gradient descent with respect to each entry in embedding dictionary D to learn the word embeddings directly.

Figure 4: Breakdown of the contribution of each task.The Multitask training data is the distinct union of the three single tasks.

We evaluate this PinText embedding by KNN Query2Interest classification, where we predict a list of interest terms for a search query. An example of labeled data of this task is: {Query: wood grain cabinets kitchen, Labels: [__label__home_decor, __label__diy_and_crafts]}. We compute the query and interest embeddings based on various text embedding models and then use the nearest interest as the prediction. It turns out this supervised model is significantly better than pre-trained unsupervised models. It is also clearly better than single-task PinText learning. Note PinText-MTL is only 2% more accurate than PinText-SR (Table 1), but the gain in word coverage is much larger, which means it will perform even better on other scenarios. Refer to figure 4(b) for the token contribution of each task.

With the learned word embedding dictionary, we are able to derive entity embedding and use KNN search on top for retrieval. We also use textual terms as index tokens. However, sometimes textual representation on complex entities has some limitations:

  • Completeness: Some terms are semantically close to each other while they have totally different spellings. For long-tail queries, it is often difficult to find the terms in candidates.
  • Compactness: A user or Pin may have up to hundreds of annotation terms. It is hard to summarize the theme of such a complex entity by using concrete text terms. Lengthy textual representation results in ambiguity.
  • Continuity: When a partial match happens, we need a quantitative continuous way to define whether we should return candidates for a particular query.

Pinterest specializes in visual search that shows many possible inspirational ideas rather than providing concrete answers to factual questions. This open-ended query nature makes the issues above even more obvious. Using a text embedding model, each entity can be compressed into a fixed-length real vector, which provides compact semantic representations in a unified universe. Therefore, we can match a query to candidates by a similarity measure in this common space instead of relying on exact term match, which to a large extent solves the compactness and completeness issues. The similarity score can be used as a continuous measure to filter candidates in a natural way or as a discriminative feature in supervised models. See the ads retrieval examples in Figure 5 below.

Figure 5: Examples of PinText-based retrieval and search keyword broad match in Pinterest. These top four queries exhibit the diversity advantage of embedding-based retrieval when the query is not specific or exact term match is not good. The cosine similarity between queries and Pins can also serve as a ranking feature.

We are continuously investing in this area. Averaging word embeddings makes our system simple and efficient, but we also lose information such as word order. Since the text is often concrete terms instead of natural language sentences, it is not straightforward to apply NLP models to it. We are actively developing the next version of PinText.

Acknowledgments: The author would like to thank teammates from the Pinterest Infra team — including Lida Li, June Liu, Suli Xu, Chengjie Liu — as well as the Ads Quality team for their seamless collaboration. We thank Stephanie deWet, Mukund Narasimhan, Jiafan Ou, and Nick Liu for many fruitful discussions.

Pinterest Engineering Blog

Inventive engineers building the first visual discovery engine, 200 billion ideas and counting.