Tutorial: Unsupervised Ranking Using Machine Learning
Introduction to Ranking
Every day we encounter ranking, especially machine learning-aided ranking, without even realizing it. Whether you are shopping on Amazon, looking for your next flight, searching for a show to binge-watch on Netflix or querying something on Google, you have been exposed to a ranking algorithm.
Unlike standard classification or regression algorithms, ranking algorithms try to order a bunch of possible results based on a query. In this post, we will focus on querying text and returning results that are relevant to the queried text. Here’s a diagram to help you decide what sort of ranking method you’d use given a ranking task:
Learning to Rank
If we are fortunate enough to have labels for our ranking problem (which is usually the case) then we are in the Learning To Rank (LTR or LETOR) domain. The dataset in this case usually takes the following form:
Each entry begins with a relevance score, ranging from 0–4, higher scores indicating higher relevance. This is followed by a unique query id. Finally, we have ~700 processed normalized features about the data. Feature Engineering plays a huge role in ranking search engine results.
These features are heavily guarded industry secrets, which is why it might be hard to find out exactly how these features were computed and what they represent. Some interesting publicly available features for LTR are the number of slashes in a URL, PageRank as a feature, BM25 score as a feature, and quality score of a web page applied on different levels such as the body, title, url and document level.
Please refer to these posts (Post1, Post2) if you’d like to read more about LTR. (To add more fuel to XGBoost’s popularity fire, XGBoost has LTR capability as well.)
Unsupervised Ranking
To make the problem more interesting, let’s consider a case where we do not have any labeled data to train on. One option is applying pre-ML methods such as PageRank and BM25. Another is to try framing the ranking problem as an Information Retrieval problem and tackling it using context based search. For this example, let’s focus on unsupervised ranking using BM25 and context based search.
Typical ranking architectures using ML retrieve the top-k entries based on the input query using a relatively light model. A re-ranker, which is a heavier model, further refines the ranking of these entries. The following figure illustrates this idea:
BM25
BM25 is an algorithm with a lot of mileage: it’s more than 30 years old. At the risk of oversimplification, it’s just fancy TF-IDF (Term Frequency — Inverse Document Frequency), where terms appearing too frequently in the corpus, the set of documents containing all the words of interest, are penalized correspondingly.
The BM25 algorithm returns a score for each document in the corpus for a particular input query with a higher score representing a better result. When a query contains multiple words we use a linearly weighted combination of scores for each word. This score is influenced by the TF-IDF as well as the length of the document. A small example follows:
from rank_bm25 import BM25Okapicorpus = [
"Paris is beautiful",
"This city is amazing and I would visit again",
"Can I have some coffee?"
]tokenized_corpus = [doc.split(" ") for doc in corpus]bm25 = BM25Okapi(tokenized_corpus)
We have created a corpus and initialized the BM25 model. Each sentence in our corpus is considered to be a document. Next we define a function that will return a score for each document:
def return_bm25_scores(query):
tokenized_query = query.split(" ")
doc_scores = bm25.get_scores(tokenized_query)
print(doc_scores)
return None
Let’s check the BM25 scores for a couple of simple queries. (It’s trivial to convert a list of scores to a list of ranks):
As expected, the relevant document returns the higher score, and documents that do not contain the words queried return a score of 0. Let’s see what happens when we query “paris” instead of “Paris”.
Since BM25 works based on exact matches, changing the case does have an effect on the result. However, we could simply lower the entire corpus as well as the query to rectify this. Let’s check whether changing the order of the words has an impact:
BM25 is order agnostic and doesn’t really take context into account. This will be remedied by the Bi-Encoder and Cross-Encoder (SBERT) models.
Framing Ranking as a Contextual Text Similarity task
This process uses sentence embeddings where each sentence is represented as a vector. Similar to the process used to compute BM25 scores, we compute the sentence embedding for a query and compare that to the sentence embeddings of all the entries in the corpus. Each query-document pair is assigned a score and then ranked against one another.
The bedrock of all our work is the BERT model which generates bi-directional representations based on both the left and right context. The BERT model takes a sentence pair as input, where each word in the sentence pair has the following embeddings:
- The token embeddings for a word
- Segment embeddings, which indicate whether a word belongs to the first or the second sentence in the sentence pair
- Position embeddings
Bi-Encoders and Cross-Encoders
We can find similar sentences in a corpus with the BERT sentence pair input but the process is computationally expensive. Here are some extracts from the SBERT paper on finding similar sentences that compare BERT to SBERT:
Finding in a collection of n = 10 000 sentences the pair with the highest similarity requires with BERT n·(n−1)/2 = 49 995 000 inference computations.
This (SBERT) reduces the effort for finding the most similar pair from 65 hours with BERT / RoBERTa to about 5 seconds with SBERT, while maintaining the accuracy from BERT.
This is where Bi-Encoders and Cross-Encoders come in. Bi-Encoders are computationally inexpensive by comparison and can be used to generate embeddings for a sentence. Cross-Encoders are computationally expensive and, just like BERT itself, cannot generate embeddings. These fit in quite well with the proposed ML Search Engine architecture, with a Bi-Encoder being used to retrieve an initial set of results that are then be re-ranked by a Cross-Encoder.
Notice how we need to encode our queries and sentences when using Bi-Encoders and how we can directly input a sentence pair when using Cross-Encoders. Also note how the relevant passage is scored higher in each case.
SBERT uses a Siamese network architecture, which consists of two identical networks, sharing the same weights and biases. Each sentence in the sentence pair serves as the input to one of the networks. These networks each return a representation of the input, which are compared against each other to give us a similarity score.
Summary
This post introduced how to approach unsupervised ranking problems using BM25 and context based search / textual similarity. As far as next steps go, we can go in a couple of directions. Either we look to improve our existing results by experimenting with different sentence embeddings, especially those with a different maximum sequence length value or we could try supervised / unsupervised methods to combine the rankings from different models. The latter is known as rank aggregation; to learn more, see this article.
References and Citations:
[4] Nils Reimers, Iryna Gurevych — Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
[6] SBERT library
[7] The flowchart called “Guide to choosing ML models for ranking use cases” was designed using Lucidchart.