Pointwise, Pairwise and Listwise Learning to Rank

4 min readApr 16, 2020

“Learning to rank is a task to automatically construct a ranking model using training data, such that the model can sort new objects according to their degrees of relevance, preference, or importance.” — Tie-Yan Liu, Microsoft Research (2009)

As described in the previous post, Learning to rank (LTR) is a core part of modern search engines and critical for recommendations, voice and text assistants. We also saw various evaluation metrics and some traditional IR models. This post gives in-depth overview of pointwise, pairwise, listwise approach for LTR.

Problem Definition

The ranking R of ranker function fθ over a document set D is
R = (R1, R2, R3 …)

Where documents are ordered by their descending scores:
fθ(R1) ≥ fθ(R2) ≥ fθ(R3) ≥ . . .
and every document is in the ranking:
d ∈ D ⇐⇒ d ∈ R

(medium really makes it difficult to write equations)

Supervised and Semi-supervised LTR methods require following data:

Queries, denoted by q
Result from existing search ranking function a.k.a. Results you want to re-rerank, also referred to as ‘document’ in web search context. Lets refer to this as: d
Labels for query-result pair (relevant/not relevant).
In real world setting, such data can be obtained from search logs. We will refer to label as y.

So for a document, relevancy can be indicated as y(d). To restrict scope and ease of understanding, we will not talk about case of same document for multiple queries, hence keeping query out of notation y(d).

https://medium.com/@purbon/listnet-48f56cb80bb2

Pointwise Learning to Rank

In pointwise LTR, we frame the ranking problem like any other machine learning task: predict labels by using classification or regression loss.

However, the problem with this approach is that we are optimising for being close to label and not for ranking documents. Hence compromising ordering.

Pairwise Learning to Rank

Learning from pointwise approach, pairwise LTR is the first real ranking approach: pairwise ranking ranks the documents based on relative score differences and not for being close to label. Loss here is based on pairs of documents with difference in relevance.
Illustrating unnormalised pairwise hinge loss:

Here, we sum over all the pairs where one document is more relevant than another document and then the hinge loss will push the score of the relevant document to be greater than the less relevant document. If difference is greater than 1 then max() will turn it to hinge loss where we will not optimise it anymore. This pushes documents away from each other if there’s a relevance difference. So the scores don’t have to match the labels, they should be rather properly ranked.
Pairwise LTR has one issue: every document pair is treated equally important, such setting is not useful in real world scenario because we expect our search systems to answer correctly in top 10 items and not in top 50.
Such ranking system does not look at the pairs it’s trying to fix and where they are in ranking thereby resulting in compromising quality of top 10 results for improving on tails, eg. top 50. So it’s improving the ranking very far down the list but decreasing at top.

Another, better approach was definitely required.

Listwise Learning to Rank

Idea behind listwise LTR is to optimise ranking metrics directly.
For example, if we care about DCG (discounted cumulative gain) — a popular ranking metric discussed in previous post, with listwise LTR, we would optimise this metric directly. Because if the metric is something that tells us what the quality is then that’s whats we should be optimising as directly as possible.

Problem with DCG?
log2 (rank(di) + 1) is not differentiable so we cannot use something like stochastic gradient descent (SGD) here.

To solve this problem, we typically:
1. Use probabilistic approximations of ranking (eg. ListNet)
2. Use heuristics or bounds on metrics (eg. LambdaRank, LambdaLoss)

For example, the LambdaRank loss is a proven bound on DCG

Here, we again sum over document pairs but now there is a weight according (defined by log() term in equation) to which how much DCG changes (defined by absolute delta of DCG) when you switch a pair.

Conclusion

In this blog post we introduced the pointwise, pairwise and listwise approach to LTR as presented in the original papers along with problems in each approach and why they were introduced in first place.

Connect with me on LinkedIn or twitter for more on search, relevancy and ranking

References:
1. Tie-Yan Liu, Microsoft Research Asia (2009), “Learning to Rank for Information Retrieval”
2. Bhaskar Mitra and Nick Craswell (2018), “An Introduction to Neural Information Retrieval”
3. Harrie Oosterhuis (Google Brain), Rolf Jagerman at The Web Conf