Search and ranking for information retrieval (IR)

Published in

Data Science at Microsoft

12 min readDec 13, 2022

Information Retrieval (IR) is the process of obtaining relevant resources from a collection of available resources against a given question or query. A query is basically a set of keywords used to search for a resource on any platform. You give a query to the IR system, and you are provided with a list of ranked matching resources as search results.

IR systems are the backbone of search engines. Bing or Google are probably the first examples of search engines that you think of. But search engines also include Expedia or other flight searching tools, LinkedIn or other job searching tools, and search tools on sites like Reddit, Instagram, Facebook, YouTube, retail websites, or any other resource that you generally browse and search for information on the internet.

In any Information Retrieval system, there are many components working in background to fetch the best matching, the most relevant search result for your search query.

In this article, we will discuss how to get the most relevant search results and the best matching results through different ranking techniques.

Throughout the discussion we will assume that the resource here is a corpus or collection of textual documents. These documents have certain attributes like title, description, and content.

Step 1: Search — Fetch the best matching documents

Given a user query q and a corpus of documents D, we want to find set of all the documents d within D matching the input query.

There are multiple ways we can solve this problem. Some of the approaches are listed below:

Direct Match: Perform a direct text match between the input query and document title, description, or content. This is the least effective approach as it is highly unlikely that the user will search for exact query as present in the document attributes.
Regex approach: In this approach we use regular expressions to match the input query and document attributes. It works slightly better than direct match, as it can match more documents. It is very difficult to design regular expressions which can all possible user query variations.
Fuzzy matching approach: There are couple of algorithms available for fuzzy matching like token sort ratio, token set ratio and partial ratio. These algorithms return similarity score between the two strings based on the matching algorithm and can be used to compare the input query and document attributes.
Distance based approach: Hamming distance and Levenshtein distance are distance based techniques which count the number of insertions, substitutions, deletions required to convert one string to another
TF-IDF vector similarity: In this approach, both query string and document are represented as vectors. Each dimension of the vector represents a term in the vocabulary (list of unique terms across all documents and queries) and the value can be either 1 or 0 indicating presence or absence of the word in query/document, it can be frequency of the word in query or document or it can also be TF/IDF score (Term Frequency — Inverse Document Frequency). Once both the query and document are represented as a vector, we can use cosine similarity to calculate the similarity score between the two.
Embedding vector similarity: In this approach, both query and document are represented as dense vectors of fixed length which are learned during model training or can be pre-loaded with pre-trained word vectors like google-word vectors learned using skip-gram model. Once represented as vector cosine similarity can be used to calculate the similarity score.

So far, we have seen the approaches to calculate similarity between input query and the document. Each method has advantages and disadvantages, so try out different techniques and select the best one which suits your data.

In the following section we will discuss ranking, which is dependent not only on the similarity score, but also on other attributes of the query and document.

Step 2: Rank — Rank the documents in order of relevance

In the previous search step, we have extracted documents similar to the search query based on their syntactic and semantic relatedness. This might not be sufficient in most of the cases as we also need to return an ordering of these documents.

One simple approach may be ranking these documents on the similarity. However, there are many situations in which the search query is exactly present in the document, but the document still might not be the best result (maybe due to relevancy, age, trustworthiness, medium type, or something else). Hence, we need to rank the search results, such that the most relevant ones are at the top.

Ranking can be driven by three sets of features:

Query features (q^(k)), which depend only on the query. For example, query length, or the number of words in a query.
Document features (d^(l)), which depend only on the document. For example, document length, document category or topic, and page rank.
Query-Document features (s^(m)), which depend both on query and document. For example, query-document similarity score.

These 3 feature sets are combined into a feature vector for input to ranking algorithms. Input feature vector for ith query and document jth is given by,

A ranking problem differs from a typical classification and regression problem because in ranking problem, the outcome is a list of items for a single query, whereas in a classification or regression problem, the outcome is a single value for each data point. Therefore, we cannot use evaluation metrics like accuracy, precision, recall, or MSE (Mean squared error) for ranking algorithms.

Before diving into ranking algorithms, let us understand some of the evaluation metrics for ranking:

Hit Ratio: It is simply the fraction of queries for which the correct answer is included in the recommendation list of length L.
Mean Reciprocal Rank: The reciprocal rank of a query response is the multiplicative inverse of the rank of the first correct answer. The mean reciprocal rank is the average of the reciprocal ranks of results for a sample of queries Q.
Mean Average Precision (MAP): MAP is the mean of Average Precision. If we have the AP for each query, it is trivial just to average it over all queries to calculate the MAP.
Precision@K: It is the fraction of relevant items in the top k recommendations.
Recall@K: It is the coverage of relevant times in the top k recommendations.
Discounted Cumulative Gain(DCG): Cumulative Gain (CG) is defined as the sum of gains up to a position k in the recommendation list. Even when we change the relative order of any two items, the CG would be unaffected. This is problematic when ranking order is important. DCG solves this problem by introducing the discounting factor based on the rank of item. It penalizes highly relevant items more if they are placed at bottom.

7. Normalized Discounted Cumulative Gain (NDCG): One problem with DCG is that the score adds up with the length of recommendation list. So, DCG of list with top-10 items will always be higher than the DCG of list with top-5 items just because of the length of the recommendation list, not because of the quality of list. NDCG tackles this issue by introducing IDCG (Ideal DCG), which is the DCG score of ideal ranking list with top k items.

NDCG is simply DCG score normalized by IDCG

For more details on evaluation metrics for ranking refer to Ranking Evaluation Metrics for Recommender Systems | by Benjamin Wang | Towards Data Science.

In our ranking problem, the ranking model is a machine learning model that learns to predict a score r given an input x = (q, d) during a training phase by minimizing the loss. The exact loss function depends on the algorithm used (discussed below), but the core idea of ranking loss is to compute the difference between true ranking and predicted ranking which we want to minimize. True ranking is the source of truth, which we want to achieve and is in form of relevance score for each query-document pair. Defining the source of truth is often limited to a small subset of queries and is biased based on who and how they were defined.

There are 3 approaches for learning to rank:

Pointwise method
Pairwise method
Listwise method

Pointwise method

In this approach, we formulate the ranking problem as a regression problem. Each data point is a feature vector created from input query qi and document dj (refer to equation 1), and target variable is a continuous value yij which is the relevance score of each document dj for query qi. Since for each data point target is a single continuous value, we can use regression techniques and evaluation metrics to train the model.

One drawback of this method is that we need true relevant scores to train the model. In most of the cases, we might only know the ordering of items and not the absolute relevant scores. Hence, we may want to consider a method which can work with relative relevant scores.

Pairwise method

This method works with relative scores and tries to predict if the first document is more relevant than the second one. Ranking problem is formulated as binary classification problem. The scoring model will still score one document at a time

But the loss function is based on pair of documents,

where ϕ can be

Hinge function [Herbrich et al., 2000]: ϕ(z) = max(0, 1-z)
Exponential function [Freund et al., 2003]: ϕ(z) = e^(-z)
Logistic function [Burges et al., 2005]: ϕ(z) = log(1 + e^(-z))

RankNet[1] is a popular pairwise loss function which uses cross entropy as the loss function and logistic function for mapping scores to probabilities. Specifically, if f(x1) and f(x2) are score functions, then, f(x1) > f(x2) implies that rank(x1) > rank(x2).

Hence, RankNet is optimizing for (a smooth, convex approximation to) the number of pairwise errors. There are other algorithms based on RankNet such as LambdaRank and LambdaMart which are explained later in the article.

Listwise method

Previous two methods transform the ranking problem into regression or classification task where we minimize regression loss like mean squared error or classification loss like cross entropy. It is often the case that the evaluation metric and the loss function are different. For example, in classification models, you might be minimizing the binary cross entropy loss function and at the same time monitoring accuracy, precision, recall as your evaluation metrics. Listwise methods try to directly or indirectly optimize the ranking evaluation metrics like Normalized Discounted Cumulative Gain (NDCG).

For each input query, outcome is a list of ranked documents, which are evaluated against ideal ranked list. From this output, we can calculate the NDCG against a list of documents. However, using evaluation metrics like NDCG as the loss function make the optimization problem quite challenging. Calculating these metrics require sorting the documents based on their predicted scores, but sorting is non-differentiable. Hence, we cannot use direct gradient-based techniques to optimize the loss function. An alternative would be to use non-gradient optimization techniques, like grid search, coordinate search, or other exhaustive search to find the optimal solution. These techniques do not scale well as the search space can grow huge with the increasing number of features. Also, since our evaluation metrics are neither smooth nor convex, these methods do not guarantee true global maxima. Different lines of research trying to tackle this optimization problem:

Approximate the objective function. For example, SoftRank[4] is one such algorithm where we optimize SoftNDCG which is the expected NDCG over rank distribution. One of the fundamental problems in formulating ranking as a machine learning problem is, that many of the machine learning algorithms require gradient of training objective to optimize for model parameters, but these IR metrics are non-smooth and hence problematic. IR metrics depend on the ranks rather than the absolute scores. As we change the model parameters, scores change smoothly, but the ranks of documents change when the one document’s rank passes over other document’s rank, making it a discontinuous change. SoftRank first defines a distribution over all possible ranked lists after introducing uncertainty to scores. It then defines SoftNDCG as the expected NDCG over this distribution and uses it as the optimization objective. SoftNDCG is non-convex and can easily get stuck at local optima.
Formulate ranking as structured learning problem in which a ranked list is treated as a unit and a loss is defined as its distance to the ideal ranked list that is sorted by relevance labels.
Use ranking metrics to dynamically reweight instances during iterative training procedure. For example, AdaRank[3], AdaBoost
Formulate ranking as weight optimization problem with objective to maximize NDCG. In this approach given a set of features, we try to find set of weights for these features which maximize NDCG. Given an input query q, feature vector x, f(q, x) is the relevance score for document x. For a given query calculate relevance score for all the documents and sort them on the score. Calculate the NDCG score using the new ranking. Objective here is to find best set of weight for the features which maximize NDCG. Bayesian Optimization can be used effectively to achieve this.

LambdaRank[2] The key idea of LambdaRank is to bypass the difficulties introduced by sort in most IR measures, by writing down the gradients directly rather than deriving them from the cost. To train the model, we don’t need the cost itself, we only need the gradients of cost w.r.t model scores.

LambdaMart[2] is a later version of LambdaRank based on gradient boosting decision tree models instead of neural networks used in LambdaRank

LambdaRank and LambdaMart are in-between pairwise and listwise methods. They have empirically shown to be effective, but it is not clear if they indeed optimize NDCG over training data. Hence, LambdaLoss[5] provides a framework that can define and optimize metric-driven loss functions. It is a probabilistic framework for ranking metric optimization with negative log likelihood as the loss function given by

This generic loss function has two components — the likelihood function p(y|s,π) and ranked list distribution p(π|s). Different variations of these functions can lead to different loss functions, like the logistic loss used in RankNet.

Conclusion

In order to solve a ranking problem, it is essential to choose right evaluation metric and formulate the problem as per the business needs.

In this article, we discussed the multiple ranking techniques including the pointwise method, the pairwise method, and the likewise method. If labelled data with relevance scores is provided and absolute relevance scores need to be predicted, use pointwise methods. Use pairwise methods when error at top is to be treated same as an error at lower rank position. Finally use listwise methods if you want to evaluate rank ordered lists giving more importance to the correctness of items at the top while discounting the errors in lower rank order items.

If you are having trouble determining which of these methods to use for your ranking or which evaluation metric to use, make sure you understand the business needs when formulating the problem to understand what is most important to optimize for.

Nitin Agarwal is on LinkedIn.

References

[1] Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. 2005. Learning to rank using gradient descent. In Proc. of the 22nd International Conference on Machine Learning (ICML). 89–96

[2] Christopher J.C. Burges. 2010. From RankNet to LambdaRank to LambdaMART: An Overview. Technical Report MSR-TR-2010–82. Microsoft Research

[3] Jun Xu and Hang Li. 2007. AdaRank: A Boosting Algorithm for Information Retrieval. In Proc. of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). 391–398

[4] Michael Taylor, John Guiver, Stephen Robertson, and Tom Minka. 2008. Softrank: optimizing non-smooth rank metrics. In Proc. of the 2008 International Conference on Web Search and Data Mining (WSDM). 77–86.

[5] Xuanhui Wang, Cheng Li, Nadav Golbandi, Michael Bendersky, and Marc Najork. 2018. The LambdaLoss Framework for Ranking Metric Optimization. In 27th ACM International Conference on Information and Knowledge Management. 1313–1322