Tutorial: Unsupervised Ranking Using Machine Learning

Published in

Jumio Engineering & Data Science

7 min readSep 30, 2022

Introduction to Ranking

Every day we encounter ranking, especially machine learning-aided ranking, without even realizing it. Whether you are shopping on Amazon, looking for your next flight, searching for a show to binge-watch on Netflix or querying something on Google, you have been exposed to a ranking algorithm.

Unlike standard classification or regression algorithms, ranking algorithms try to order a bunch of possible results based on a query. In this post, we will focus on querying text and returning results that are relevant to the queried text. Here’s a diagram to help you decide what sort of ranking method you’d use given a ranking task:

Fig: Guide to choosing ML models for ranking use cases

Learning to Rank

If we are fortunate enough to have labels for our ranking problem (which is usually the case) then we are in the Learning To Rank (LTR or LETOR) domain. The dataset in this case usually takes the following form:

Fig: Sample data point from the Yahoo LTRC dataset

Each entry begins with a relevance score, ranging from 0–4, higher scores indicating higher relevance. This is followed by a unique query id. Finally, we have ~700 processed normalized features about the data. Feature Engineering plays a huge role in ranking search engine results.

These features are heavily guarded industry secrets, which is why it might be hard to find out exactly how these features were computed and what they represent. Some interesting publicly available features for LTR are the number of slashes in a URL, PageRank as a feature, BM25 score as a feature, and quality score of a web page applied on different levels such as the body, title, url and document level.

Please refer to these posts (Post1, Post2) if you’d like to read more about LTR. (To add more fuel to XGBoost’s popularity fire, XGBoost has LTR capability as well.)

Unsupervised Ranking

To make the problem more interesting, let’s consider a case where we do not have any labeled data to train on. One option is applying pre-ML methods such as PageRank and BM25. Another is to try framing the ranking problem as an Information Retrieval problem and tackling it using context based search. For this example, let’s focus on unsupervised ranking using BM25 and context based search.

Typical ranking architectures using ML retrieve the top-k entries based on the input query using a relatively light model. A re-ranker, which is a heavier model, further refines the ranking of these entries. The following figure illustrates this idea:

Figure: Possible architecture of a ML based search engine. Source: Wikipedia.

BM25

BM25 is an algorithm with a lot of mileage: it’s more than 30 years old. At the risk of oversimplification, it’s just fancy TF-IDF (Term Frequency — Inverse Document Frequency), where terms appearing too frequently in the corpus, the set of documents containing all the words of interest, are penalized correspondingly.

The BM25 algorithm returns a score for each document in the corpus for a particular input query with a higher score representing a better result. When a query contains multiple words we use a linearly weighted combination of scores for each word. This score is influenced by the TF-IDF as well as the length of the document. A small example follows:

from rank_bm25 import BM25Okapicorpus = [
    "Paris is beautiful",
    "This city is amazing and I would visit again",
    "Can I have some coffee?"
]tokenized_corpus = [doc.split(" ") for doc in corpus]bm25 = BM25Okapi(tokenized_corpus)

We have created a corpus and initialized the BM25 model. Each sentence in our corpus is considered to be a document. Next we define a function that will return a score for each document:

def return_bm25_scores(query):
    tokenized_query = query.split(" ")
    doc_scores = bm25.get_scores(tokenized_query)
    print(doc_scores)
    return None

Let’s check the BM25 scores for a couple of simple queries. (It’s trivial to convert a list of scores to a list of ranks):

As expected, the relevant document returns the higher score, and documents that do not contain the words queried return a score of 0. Let’s see what happens when we query “paris” instead of “Paris”.

Fig: Analyzing the effects of case mismatch

Since BM25 works based on exact matches, changing the case does have an effect on the result. However, we could simply lower the entire corpus as well as the query to rectify this. Let’s check whether changing the order of the words has an impact:

Fig: Checking whether changing the query word order has an impact on search results

BM25 is order agnostic and doesn’t really take context into account. This will be remedied by the Bi-Encoder and Cross-Encoder (SBERT) models.

Framing Ranking as a Contextual Text Similarity task

This process uses sentence embeddings where each sentence is represented as a vector. Similar to the process used to compute BM25 scores, we compute the sentence embedding for a query and compare that to the sentence embeddings of all the entries in the corpus. Each query-document pair is assigned a score and then ranked against one another.

The bedrock of all our work is the BERT model which generates bi-directional representations based on both the left and right context. The BERT model takes a sentence pair as input, where each word in the sentence pair has the following embeddings:

The token embeddings for a word
Segment embeddings, which indicate whether a word belongs to the first or the second sentence in the sentence pair
Position embeddings

Fig: BERT Input Representation. Source: BERT.

Bi-Encoders and Cross-Encoders

We can find similar sentences in a corpus with the BERT sentence pair input but the process is computationally expensive. Here are some extracts from the SBERT paper on finding similar sentences that compare BERT to SBERT:

Finding in a collection of n = 10 000 sentences the pair with the highest similarity requires with BERT n·(n−1)/2 = 49 995 000 inference computations.
This (SBERT) reduces the effort for finding the most similar pair from 65 hours with BERT / RoBERTa to about 5 seconds with SBERT, while maintaining the accuracy from BERT.

This is where Bi-Encoders and Cross-Encoders come in. Bi-Encoders are computationally inexpensive by comparison and can be used to generate embeddings for a sentence. Cross-Encoders are computationally expensive and, just like BERT itself, cannot generate embeddings. These fit in quite well with the proposed ML Search Engine architecture, with a Bi-Encoder being used to retrieve an initial set of results that are then be re-ranked by a Cross-Encoder.

Fig: Bi-Encoder being used to check context based similarity

Fig: Cross-Encoder being used to check context based similarity

Notice how we need to encode our queries and sentences when using Bi-Encoders and how we can directly input a sentence pair when using Cross-Encoders. Also note how the relevant passage is scored higher in each case.

SBERT uses a Siamese network architecture, which consists of two identical networks, sharing the same weights and biases. Each sentence in the sentence pair serves as the input to one of the networks. These networks each return a representation of the input, which are compared against each other to give us a similarity score.

Summary

This post introduced how to approach unsupervised ranking problems using BM25 and context based search / textual similarity. As far as next steps go, we can go in a couple of directions. Either we look to improve our existing results by experimenting with different sentence embeddings, especially those with a different maximum sequence length value or we could try supervised / unsupervised methods to combine the rankings from different models. The latter is known as rank aggregation; to learn more, see this article.

References and Citations:

[1] AI — Ch22 — BM25 scoring

[2] BM25 documentation

[3] A complete guide to transfer learning from English to other Languages using Sentence Embeddings BERT Models

[4] Nils Reimers, Iryna Gurevych — Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

[5] Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova — BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

[6] SBERT library

[7] The flowchart called “Guide to choosing ML models for ranking use cases” was designed using Lucidchart.