P-nDCG: a new learning-to-rank metric for large lists of imbalanced binary data

By Ulric Wong

Published in

Valassis Engineering Blog

9 min readNov 21, 2019

Source: https://www.pexels.com/photo/close-up-photography-of-a-white-line-209981/

We recently took a closer look at some of the metrics we use for evaluating the performance of machine learning models here at Valassis Digital. It is common practice to use machine learning performance metrics for evaluating the performance of trained models and for model selection during hyper-parameter tuning. Many metrics were created with web search results in mind, where only the first handful of returned results matter. In contrast, machine learning models in online advertising must produce extremely large lists of ranked results. Positive cases are also extremely rare in online advertising meaning machine learning models must be trained and evaluated on highly imbalanced data. How many evaluation metrics actually perform well in this highly specialized use case?

At Valassis Digital we run online advertising campaigns by serving advertisements(”ads”) to people (”users”) as they view web pages on their web browsing software. This is accomplished via a real time auction that occurs when a user attempts to load up a web page in their browser. Interested parties offer competing bids for the right to display a digital ad in one of the available ad slots, usually along the top or side of a web page. The goal is to get a user to make a purchase or action that benefits the client i.e., to “convert”.

Figure 1. Example of a user’s experience of being served an ad, clicking, and eventually completing a conversion in their web browser.

Valassis Digital attempts to optimize cost-per-action (CPA) for clients by maximizing the number of ads served that lead a user to convert, while minimize serving to those who do not convert. The scale and real time nature of the problem means that Valassis Digital trains and deploys machine learning classifiers to decide how much monetary value to assign to bids for opportunities to serve ads. The size of the lists of users scored by our machine learning models can be huge: millions of users per ad campaign per day. Conversions are usually rare events. Therefore, data sets used to train machine learning models to predict whether a person will convert, if served an ad, exhibit extreme class imbalance.

Valassis Digital’s machine learning classifiers produce lists of millions of user scores ranked based on the probability of conversion, which is a parameter for real time bidding. Therefore, a desirable metric for measuring machine models is one that deals well with:

evaluation of long lists of user scores with millions of entries
imbalanced data
evaluation of both Type I and Type II error
ranking of conversions within a list of observations scored by a trained estimator
evaluation of absolute performance of a trained estimator on a set of data
evaluation of relative performance between trained estimators applied to the same data

However, many of the most popular model evaluation metrics do not have these desirable qualities.

Popular metrics for binary lists

At Valassis Digital we have employed a number of different metrics for evaluating machine learning classifiers. Several metrics can be used to evaluate ranked, binary lists, including AUC:ROC (area under the curve: Receiver Operating Characteristics), AUC:PR (area under the curve: Precision-Recall) and NDCG (Normalized Discounted Cumulative Gain).

Given a list of users and conversion probabilities, we can choose a static threshold in order to discriminate between positive and negative cases. This allows for the calculation of the confusion matrix and subsequent computation of Sensitivity, Specificity, FPR, etc. For reference see this definition of classification terms.

If we vary the threshold, instead of choosing a static one, we can plot the ROC and PR curves which allow us to compute AUC:ROC and AUC:PR. Due to the fact that both the ROC and PR curves are generated from the confusion matrix, there is a one to one correspondence between the two curves [Davis and Goadrich, 2006]. Moreover, when one classifier dominates another in the ROC space, meaning that the ROC curve is higher, it also does in the PR space.

However, it seldom happens that one classifier completely dominates another for every pair of FPR and TPR or precision and recall. For this reason, it is preferable to compute AUC in order to choose the preferred classifier. Both AUC:ROC and AUC:PR have the property of being threshold free, i.e., they evaluate the scores’ estimates without choosing a value that separates positive estimates and negative estimates, so that they can be used to evaluate ranked lists.

Finally, the PR curve is preferable to the ROC curve since the latter is invariant to weighting when estimating a classifier, i.e. it is invariant to the balance between positive and negative classes [Saito and Rehmsmeier, 2015]. At Valassis Digital, we are routinely dealing with extremely imbalanced data sets and therefore prefer the PR curve.

A learning-to-rank metric: nDCG

A popular metric used to evaluate ranked lists is Normalized Discounted Cumulative Gain (nDCG), for a reference see Discounted cumulative gain.

The general definition of NDCG is:

Equation 1. Formal definition of standard NDCG

where v is the total number of elements (or documents) in the list, g(⋅) is a generic function that takes a value in ℝ+, relᵢ is the relevance score of the iᵗʰ element, RELᵥ is the ranked list where the elements are ranked based on their value of relᵢ, which corresponds to an ideally ranked list. A typical choice for g(⋅) is g(x) = x. The function g(x) = 2ˣ -1 is also used whenever it is of interest to give higher weight to relevant documents [Burges et al., 2005]. IDCGᵥ is a normalizing constant so that nDCGᵥ ϵ [0, 1] and it represents the ideal value of the lists, that is when all the most relevant documents are placed at the top of the list. Because of this normalization, nDCG is unaffected by the choice of the logarithm base. At Valassis Digital, we are dealing with a binary dependent variable (i.e., user converted / did not convert), so we will set the relevance scores relᵢ = yᵢ ϵ [0,1] . In particular yᵢ is equal to 1 whenever the ith observation is a converter (or a positive case), meaning that the user is interacting with the ad. Finally, users are ranked based on the machine learning classifier’s score estimate representing the likelihood of the user being a converter.

Discounting functions and NDCG

The logarithm discount has been primarily employed thanks to the smoothness of the function [Croft et al., 2010]. Nevertheless, other functions can be used, as for example the Zipfian 1/x function [Kanoulas and Aslam, 2009], as long as they are decreasing. [Wang et al., 2013] show that any decreasing function that decreases slower than the Zipfian has the consistent distinguishability property. i.e., “for two substantially different ranking functions, the ranking measure can decide which one is better consistently on almost all datasets” [Wang et al., 2013], a desirable property for a discount function. We call the nDCGᵥ with logarithm discount “standard nDCG” as in [Wang et al., 2013]. The standard nDCG has been primarily employed to evaluate the quality of web searches results, as the logarithm discount places high weight at the first positions of the list and then it becomes flatter as the rank increases. For example, consider two lists with each containing 10000 users and 10 converters. The list with a single converter in the top 100 ranking and the other 9 converters after position 100 will be preferable to a list with all 10 converters ranked just after position 500, which is still considered at the top of the list. However, in our setting we are interested in evaluating lists that contain millions of users. Moreover, when the lists are naturally sparse, meaning that they contain few converters, the standard nDCG metric usually produces very low values and we empirically experienced high variability when computing it.

Probability discount for nDCG (P-nDCG)

We now propose a new discounting function for nDCG. The main idea is to take into account probability estimates for the observations while computing the metric, not only for ranking the observations in the list.

Consider for example two lists of binary outcomes [1, 0, 1] and [1, 0, 1]. Applying standard nDCG to the lists will of course return the same value, which is equal to 0.91. However, the results should be substantially different if the probability estimates for the two lists are [0.9, 0.8, 0.1] and [0.9, 0.8, 0.7] respectively. In this case, we should prefer the second list.

For this reason we implemented a version of nDCG, which we call P-nDCG, that discounts the positive cases in the ranked list according to the estimated probabilities. In particular:

Equation 2. Numerator and Denominator of P-nDCG

where pᵢᵉˢᵗ is the estimated probability for the iᵗʰ observation, yᵢ is the true value, pˢᵒʳᵗᵉᵈ is the sorted vector of estimated probabilities and nᵩ is the number of positive cases in the data set, i.e. nᵩ= Σ yᵢ. The highest value of DCG occurs when we rank the positive cases in the first positions of the list and the negative cases afterwards. With this new metric, we evaluate the two above lists differently. In particular the P-nDCG returns the values of 0.59 and 0.94 respectively. Finally, notice that p₁ˢᵒʳᵗᵉᵈ; p₂ˢᵒʳᵗᵉᵈ; is decreasing by definition since we are sorting. Fig. 2 shows the logarithm and Zipfian discounting functions and the new proposed method. We define the Zipfian-beta function as 1/xᵇᵉᵗᵃ. Notice from the plot that the log function has a steep decrease and then it is almost flat. On the other hand, the probability discount has a gradual decrease. Flat regions for the P-nDCG discount correspond to users that have very similar profiles.

Figure 2. A comparison of discounting functions for normalization of NDCG

The nDCG metric is consistently distinguishable when the discount function goes to infinity slower than the standard Zipfian function 1/x [Burges et al., 2005]. Since the majority of the implemented models do not have a parametric form, it is unfeasible to find the right tail of the distribution of the predicted probabilities. However, from our empirical simulations, we found that the right tail of the estimated probabilities decreases slower than the Zipfian functions.

Conclusion

In this post we described P-nDCG, a new metric designed for binary lists with millions of observations. P-nDCG uses the probability estimates produced by the machine learning classifier both for ranking the users and as a discount function for computing the metric. This modification to the discount function effectively favors lists that assign higher relative importance to positive cases.

As a result, P-nDCG is a more effective metric when applied to machine learning problems with large binary lists of imbalanced data. Compared to standard nDCG, P-nDCG enables evaluation of classifier ranking efficacy deeper into large lists of millions of users rather than the just the top of the list. This allows for more meaningful comparison of machine learning model evaluation and model selection during hyperparameter tuning, which yields increased model performance. When lists are sparse P-nDCG outputs less variable values with respect to the standard nDCG making P-nDCG a more interpretable metric for imbalanced data.

Special Thanks

Federico Ferrari, a PhD candidate in the Duke University statistics program, spent part of his PhD program internship with the Data Science Team at Valassis Digital developing the P-nDCG metric.

References

Christopher Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Gregory N Hullender. Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine learning (ICML-05), pages 89–96, 2005.

W Bruce Croft, Donald Metzler, and Trevor Strohman. Search engines: Information retrieval in practice, volume 520. Addison-Wesley Reading, 2010.

Jesse Davis and Mark Goadrich. The relationship between precision-recall and roc curves. In Proceedings of the 23rd international conference on Machine learning, pages 233–240. ACM, 2006.

Discounted cumulative gain. Discounted cumulative gain | Wikipedia, the free encyclopedia. URL https://en.wikipedia.org/wiki/Discounted_cumulative_gain.

Evangelos Kanoulas and Javed A Aslam. Empirical justi cation of the gain and discount function for ndcg. In Proceedings of the 18th ACM conference on Information and knowledge management, pages 611–620. ACM, 2009.

Takaya Saito and Marc Rehmsmeier. The precision-recall plot is more informative than the roc plot when evaluating binary classifiers on imbalanced datasets. PloS one, 10(3):e0118432, 2015.

Yining Wang, Liwei Wang, Yuanzhi Li, Di He, Wei Chen, and Tie-Yan Liu. A theoretical analysis of ndcg ranking measures. In Proceedings of the 26th annual conference on learning theory (COLT 2013), volume 8, page 6, 2013.