Multi-task Learning and Calibration for Utility-based Home Feed Ranking

Published in

Pinterest Engineering Blog

11 min readSep 4, 2020

Ekrem Kocaguneli| Software Engineer, Homefeed Ranking, Dhruvil Deven Badani| Software Engineer, Homefeed Ranking, Sangmin Shin| Engineering Manager, Homefeed Ranking

Home feed is one of the most important surfaces at Pinterest that drives a significant portion of engagement from the more than 400+ million people who visit each month. From a business standpoint, the home feed is also a revenue driver, where most ads are shown to Pinners. Therefore, the way we surface personalized, engaging and inspiring recommendations in home feed is critical.

In this post we will cover how we switched from a single output node deep neural net (DNN) to a multi-task learning (MTL) based DNN. We will also cover how we calibrated each output node’s probability prediction to be combined into a utility value as well as the benefits of this new architecture.

Background and motivation

We used to score each user-pin pair with a single-output from a logistic loss DNN model (Figure 1) and rank Pins using these scores. This is a common scenario for ranking models: an action like click is identified to be optimized against; and the ranker learns the click probability based on historic engagement data which is then used for ranking.

*Figure 1. The ranking model with a single output node.*

However, at Pinterest, engagement is multi-objective: there are long-clicks, close-ups and repins. The single output DNN incorporated these objectives into the ranking score by augmenting the logistic loss in the cost function using action weights. y(i) ∈ {0,1} is the actual label and y(i) ∈ [0,1]is the predicted score of the i(th) instance out of m instances in equation 1. The output of such a model is not a probability, but rather a ranking score that captures which Pin will be more engaging (a combination of click, long-click, close-up and repin). We call this combined ranking score the pinnability score.

Pinnability score is effective in ranking user-pin pairs, but it has a few shortcomings: the business value of different actions are baked into the training data and the weighted loss. Hence, the pinnability score is simply a floating number used to rank, but it is not a probability and is not comparable from one model to another. This makes debugging and interpretability of the model quite challenging.

So, we switched to a more flexible and interpretable ranking method based on the utility of a Pin to a Pinner. At a high level, the utility of a Pin is a combination of the probability values of different actions as in Equation 3. The actual utility function used in our ranking is more involved than Equation 3, the details of which are beyond the scope of this post.

To get action-specific probability values P(action), we employed MTL where each output node predicts an action-specific probability. However, once we decide on the action weights W(action), we need the probability values to be stable and accurate between different models. To enable this, we used calibration models.

Utility-based ranking via MTL and calibration enabled us to do several things better:

Quickly change home feed characteristics: As in Equation 3, we weigh each action differently. At any time, we can decide to shift surface characteristics based on business needs (e.g., to promote more videos or to promote more repinnable/clickable content). These types of shifts previously required a cycle of training data augmentation, model parameter updates and multiple A/B experiments spanning multiple weeks. Now, we can simply adjust the utility weights of treatment groups and observe the effects within a few hours.
Ability to compare different Pin types: It used to be difficult to compare different Pin types such as organic and video. For example, we mainly look at view time on video Pins to measure engagement. The current setup enabled us to have different utility functions for different Pin types.
Improved model interpretability: Since we can monitor the calibration per each action type, we can better interpret changes, e.g., if we see a candidate generator get increased distribution, we can check its calibration and determine if the increase is justified or not.

Multi-task learning model

MTL enables us to have multiple output nodes with representation sharing [1]. Each output node (head) optimizes for a certain action type, such as repin or click.

Each head predicts a binary label: the action happened or not. The predicted value from each head is designed to be a probability score. Our current MTL model is shown in Figure 2 with additional calibration models.

*Figure 2. MTL-based ranking model with multiple output nodes for separate action types and corresponding calibration models.*

We also need to redefine the cost function. Unlike the previous pinnability model where we have a single binary label, in this model the label is a vector of n-actions (n=4 in the first version). Hence, the loss with each round of predictions is summed over all 4 actions as in Equation 4, where yij ∈ ℝ4and ℒ is the logistic loss as given in Equation 1. Thanks to parameter sharing that MTL provides among different objectives, switching to MTL alone, without utility-based ranking, provided engagement metric improvements.

The DNN in Figure 2 is only the last part (called fully-connected layer) of a larger AutoML model we use for ranking. The larger model is composed of 4 components, where we learn feature representations from raw features. While we will not go into the first 3 components in this blog post, it suffices to say that they are responsible for learning representations and crosses among features without requiring engineers to worry about feature engineering.

Calibration of Output Node Predictions via Calibration Model

The relationship of whether we are over/under-predicting is given through calibration, a post-processing technique used to improve probability estimation of a learner. There are a number of techniques that can be used like Platt Scaling, isotonic regression [6] or downsampling correction [3]. For binary classification, calibration measures the relationship between the observed behavior (e.g., empirical click-through rate) and the predicted behavior (e.g., predicted CTR) as given in Equation 5. We need a calibration model on each DNN output node to make sure that the predicted probability aligns well with empirical rates.

Initially, we tried to simply incorporate the positive downsampling rate 𝛼 and the negative downsampling rate 𝛽 as in Equation 6, where p is the probability estimate of DNN and q is the adjusted probability.

This method did not work well because our training data generation pipeline not only downsamples positives and negatives, but also enforces stratified sampling around geography, user-state and positive/negative distribution (which helps us rank better, but makes calibration harder).

We realized that the calibration had to act as a transfer learning layer that maps ranking optimized probabilities to empirical rates. For this work, we opted for a logistic regression (LR) model which can be viewed as a highly altered Platt scaling technique. Instead of just a probability score (p) weight and a bias term (b) to learn calibrated probability (p*) as in Equations 7 and 8, we used an LR model with 80+ features.

Training data generation and featurization

In order to train calibration models that can learn empirical rates, we created a new training data generation pipeline without any stratified sampling.

The pipeline comprised two parts:

Raw data logs coming from application servers, where we have the Pin IDs, context information as well as raw logs required for featurization (shown in purple in Figure 3)
Label information that we attain after Pin views and take action on the impressed Pins, which is stored in feedview logs (shown in orange in Figure 3).

Combining 1) and 2) provided us with label and raw feature information to create training data for each calibration model.

The data for each calibration model is the same except for the labels. The repin calibration data marks only repins as positive labels, click data marks only clicks as positive, and so on.

*Figure 3. Calibration model training data generation pipeline.*

We uniformly sampled 10% of the user logs to reduce training data size. We collected 7 days of logs as training data to get rid of any day-of-week effects and used the following day as test data. The performance on test data matched the model’s online calibration performance.

While we mainly relied on total calibration error (Equation 5) and reliability diagrams, we also use the following performance measures:

Log loss
Expected calibration error [4,5]

Features and Model Training

To learn a good mapping of probabilities coming from a DNN trained on stratified-sampled data, we needed to provide the model a number of features capturing different aspects:

Bias and position features: Binary and categorical features such as app type, device type, gender, bucketized positions and cluster IDs. The cluster IDs are generated by an internal algorithm mapping Pins to a pool of clusters.
User and Pin performance features: The performance (repin rate, click rate, close-up rate) of Pins and Pinners at different time granularities such as the past 3 hours, 1 day, 3 days, 30 days and 90 days.
Feedback-loop features: The empirical action rates on the platform in the last 30 minutes aggregated by overall, country, gender-cross-country. These features were included to capture fluctuations during the day.

As for the model, we chose logistic regression (LR) trained with cross-entropy (CE) loss. The fact that CE-loss incurs smaller or larger loss values based on the predicted probability helps LR achieve good calibration. Based on the class and the predicted probability, the CE loss will be smaller or larger, where it rewards correct classification and highly penalizes wrong classifications with high certainty (Figure 4).

*Figure 4. Cross-entropy loss and why LR calibrates well.*

The final model is akin to Platt scaling with a large number of features:

Replaying a new DNN on calibration training data

One problem we haven’t addressed so far is training a calibration model for a newly trained DNNnew. The fact that it is not serving means that all the calibration training data is served by the production DNNprod. This means different calibration features, since DNNnew and DNNprod may have different predicted scores.

We solved this by a three-step simulation process (Figure 5):

We make sure to keep DNN logs and calibration logs for the same 10% users. Let’s call these “common logs” (although the calibration and DNN logs used for training are different, they are created for the same user-pin pairs).
Get predictions from DNNnew against DNN evaluation data generated from the common logs .
Replace the uncalibrated probability feature values in calibration training logs with the predictions from step 2.

*Figure 5. Featurization step with simulation.*

Monitoring and alerting

We monitor and alert in realtime on the calibrated probabilities of the production model.

We also have a daily report to monitor the production and experimental models’ calibration errors. If they are over or under-calibrated by a certain threshold across any of the monitored actions, we alert the oncall engineer.

The calibration error and the calibrated probabilities are highly sensitive to changes in features. The changes can be either in DNN features (which affects the uncalibrated probability values, which in turn will affect the calibrated probabilities) or in calibration features. We were able to capture incidents in our system via calibration monitoring and alerting before they significantly affected topline metrics.

Additional use cases

In addition to utility-based ranking, the MTL / calibration framework unlocked multiple use cases.

Video distribution

Video distribution was one of our main objectives in 2019 and achieving it via the old framework was difficult. In the MTL framework, we first defined a positive label for videos: was the video viewed for more than 10 seconds? We then added a new output node to MTL to predict this label. We then calibrated this node and added it into utility, including only video-specific actions: repins, close-ups and 10 second views. This increased our video distribution by 40% with increased engagement rates.

Hide modeling

We were also able to model negative engagement using the MTL framework, with a small twist in the utility function.

Similar to videos, we first defined a label for negative engagement: Pin hides. Then we added an MTL node and calibration model for hides. Lastly, we added a high negative weight on hide probabilities (Equation 11).

Learnings and pitfalls

Although MTL is a powerful tool to handle multiple objectives, it has its pitfalls. For example, in the hard parameter sharing approach [1] we used, the hidden layers are shared by multiple objectives. Hence, each newly added objective affects other objectives. Therefore, it is important that tasks are complementary: just adding the hide head increased hides. We were able to reverse this in the utility function by effectively pushing pins with high hide predictions to the end of the list.

MTL is also not a silver bullet. The task to model still needs a decent amount of training data to affect the shared hidden layers. For example, besides hides, we also tried to model Pin reports, However, this reporting number can be small. Adding an output node for reports had no effect and we were unable to calibrate the action.

Conclusion

This work lead us to several wins on Pinner engagement, business goals and developer velocity:

We were able to show more relevant pins to users by improving the accuracy of our predictions.
We improved engineering velocity by separating the model predictions from the ranking layer. We now can iterate on ranking functions by modifying utility terms and in parallel do model iterations.
We helped the business by enabling stakeholders to quickly adjust ranking based on business needs.

Acknowledgments

This was a large project involving contributions from a number of engineers and managers. Specifically we would like to thank Utku Irmak, Crystal Lee, Xin Liu, Chenjin Liang, Cosmin Negruseri, Yaron Greif, Derek Zhiyuan Cheng, Tao Cheng, Randall Keller, Mukund Narasimhan and Vijay Narayanan.

References

[1] Ruder, Sebastian. “An overview of multi-task learning in deep neural networks.” arXiv preprint arXiv:1706.05098 (2017).

[2] Bella, Antonio, et al. “Calibration of machine learning models.” Handbook of Research on Machine Learning Applications(2009): 128–146.

[3] He, Xinran, et al. “Practical lessons from predicting clicks on ads at facebook.” Proceedings of the Eighth International Workshop on Data Mining for Online Advertising. ACM, 2014.

[4] Guo, Chuan, et al. “On calibration of modern neural networks.” Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017.

[5] Naeini, Mahdi Pakdaman, Gregory Cooper, and Milos Hauskrecht. “Obtaining well calibrated probabilities using bayesian binning.” Twenty-Ninth AAAI Conference on Artificial Intelligence. 2015.

[6] McMahan, H. Brendan, et al. “Ad click prediction: a view from the trenches.” Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. 2013.