Relevance, Ranking and Search

Mayur Bhangale
7 min readSep 7, 2019

--

This is a long overdue post and is in draft since June 2018

Photo by Victoriano Izquierdo on Unsplash

1960s — researchers were testing web search engines on about 1.5 megabytes of text data. Fast forward to 2018, we now have billions of web pages and colossal data. Though one issue which still persists is relevance.

Relevance is the core part of Information Retrieval. Roughly speaking, a relevant search result is one in which a person gets what she was searching for. Naively you could go about doing a simple text search over documents and then return results. Obviously it won’t work mainly due to the fact that language can be used to express the same term in many different ways and with many different words — the problem referred to as vocabulary mismatch problem in IR. One other issue is to maintain a line between topical relevance (relevant to search query if it’s of same topic) and user relevance (person searching for ‘FIFA standings’ should prioritise results from 2018 (time dimension) and not from old data unless mentioned).

To address issues mentioned above regarding relevance, researchers propose retrieval models.

A retrieval model is a formal representation of the process of matching a query and a document.

It is the basis of the ranking algorithm that is used in a search engine to produce the ranked list of documents. A good retrieval model will find documents that are likely to be considered relevant by the person who submitted the query. Some retrieval models focus on topical relevance, but a search engine deployed in a real environment must use ranking algorithms that incorporates user relevance.

One interesting feature of such models is that they model statistical properties rather than linguistic structures. It means ranking algorithms are far more interested in word counts than if the word is noun or verb. This view of text later became popular in 90s in natural language processing.

Evaluating IR task is one more challenge since ranking depends on how well it matches to users expectations. Cyril Cleverdon in 60s led the way and built methods around this, which to this day are used and still popular — precision and recall. Precision is the proportion of retrieved documents that are relevant and recall is the proportion of relevant documents that are retrieved. When using recall, there is an assumption that all the relevant documents for a given query are known. Such an assumption is clearly problematic in a web search environment, but with smaller test collections of documents, this measure can be useful. (See TREC for best-known test collections). Currently much of the focus in evaluation is based on clickthrough data.

Practically, spam is also one issue which affects search results. Spam in context of IR is misleading, inappropriate or irrelevant information in a document which is meant for commercial benefit. Spam is of such importance in web search that an entire subject, called adversarial information retrieval, has developed to deal with search techniques for document collections that are being manipulated by parties with different interests.

Traditional IR Models

  1. Boolean Model of Information Retrieval (BIR)
    Used by earliest search engines, BIR is also called as exact-match retrieval since documents are retrieved if they exactly match the query, and otherwise are not retrieved. Although this defines a very simple form of ranking, BIR is not generally described as a ranking algorithm. This is because the Boolean retrieval model assumes that all documents in the retrieved set are equivalent in terms of relevance, in addition to the assumption that relevance is binary.
  2. Vector Space Model
    In this model, documents and queries are assumed to be part of a n-dimensional vector space, where n is the number of index terms (words, stems, phrases, etc.). A document Di is represented by a vector of index terms: Di = (di1, di2, …., din)
    where din represents the weight of the nth term. A document collection containing n documents can be represented as a matrix of term weights, where each row represents a document and each column describes weights that were assigned to a term for a particular document:

Queries are also represented as documents.
Q = (q1, q2 …. qn)

One of the example of such model is a very popular TF-IDF model which later yielded another popular ranking function called BM25. It aggregates the contributions from individual terms but ignores any phrasal or proximity signals between the occurrences of the different query terms in the document.

3. IR as classification
Given a new document, the task of a search engine could be described as deciding whether the document belongs in the relevant set or the non-relevant set. That is, the system should classify the document as relevant or non-relevant, and retrieve it if it is relevant.

4. Query Likelihood Model
In this model, we calculate the probability that we could pull the query words out of the ‘bag of words’ representing the document. This is a model of topical relevance in the sense that the probability of query generation is the measure of how likely it is that a document is about the same topic as the query.

5. Relevance Feedback and Pseudo Relevance Feedback (PSR)
Here, instead of asking user for feedback on how the search results were, we assume that top k normally retrieved results are relevant. Typical process is as below:
1. Take the results returned by initial query as relevant results (only top k with k being between 10 and 50 in most experiments).
2. Select top 20–30 (indicative number) terms from these documents using for instance tf-idf weights.
3. Do Query Expansion, add these terms to query, and then match the returned documents for this query and finally return the most relevant documents.

Learning to Rank (LTR)

Approaches discussed above and many others have parameters (for eg. k1 and b in BM25). To get reasonably good ranking performance, you need to tune these parameters using a validation set. But sometimes a model perfectly tuned on the validation set sometimes performs poorly on unseen test queries. So what could be done for this? Let the machine automatically tune its parameters!

Formally, applying machine learning, specifically supervised or semi-supervised learning, to solve ranking problem is learning-to-rank. Inputs to models falling in LTR are query-document pairs which are represented by vector of numerical features. A model is trained that maps the feature vector to a real-valued score. Training data can be augmented with other features for relevancy. Most of the state-of-the-art learning-to-rank algorithms learn the optimal way of combining features extracted from query-document pairs through discriminative training.

For a model to be called as learning to rank model, it should have two properties:
1. It should be feature based.
2. It should have discriminative training process.

While there are many variations in which LTR models can be trained in. They can be classified in three types.

https://jobandtalent.engineering/learning-to-retrieve-and-rank-intuitive-overview-part-iii-1292f4259315

One of the most popular choice for training neural LTR models was RankNet, which was an industry favourite and was used in commercial search engines such as Bing for years.
While this is a crux of any IR system, for the sake of simplicity, I will skip details about these models in this post and keep it short.

• Metrics

IR system’s metrics focuses on rank-based comparisons of the retrieved result set to an ideal ranking of documents, as determined by manual judgments or implicit feedback from user behaviour data. Most popular metrics are defined below:

  1. Precision and Recall (described above)
  2. Mean average precision (MAP)
    If the set of relevant documents for an information need qⱼ ∈ Q is
    { d1, . . . dmj } and Rjk is the set of ranked retrieval results from top result until you get to the document dk, then:

When a relevant document is not retrieved at all, the precision value in the above equation is taken to be 0. For a single information need, the average precision approximates the area under the uninterpolated precision-recall curve, and so the MAP is roughly the average area under the precision-recall curve for a set of queries.

3. Normalised discounted cumulative gain (NDCG)
The premise of DCG is that highly relevant documents appearing lower in a search result list should be penalised as the graded relevance value is reduced logarithmically proportional to the position of the result.
But search result lists vary in length depending on the query. Comparing a search engine’s performance from one query to the next cannot be consistently achieved using DCG alone, so the cumulative gain at each position for a chosen value of should be normalised across queries. This is done by sorting all relevant documents in the corpus by their relative relevance, producing the maximum possible DCG through position p , also called Ideal DCG (IDCG) through that position.

https://en.wikipedia.org/wiki/Discounted_cumulative_gain

References:
1.
Bhaskar Mitra and Nick Craswell (2018), “An Introduction to Neural Information Retrieval”
2.
Introduction to Information Retrieval by Christopher D. Manning, Prabhakar Raghavan & Hinrich Schütze

--

--

Mayur Bhangale

Co-Founder at Sourcewiz. Into NLP, Information Retrieval, Knowledge Graphs and Design. Forbes 30 under 30 Asia