Learning to Rank 101

Getting yourself started into building a search functionality for your project is today easier than ever, from the top notch open source solutions such as Elasticsearch and Solr to fully functional cloud solutions such as Algolia, Elastic Cloud, Amazon Cloud Search, Azure Search and others.

However relevancy is still the poor brother in such solutions, but that shouldn’t have to be like this for ever in your organization. But this post will not be about how relevancy can be a great thing for your project, we will introduce learning to rank, a technique than brings machine learning and search closer.

Learning to rank is a collection of supervised and semi supervised learning methods that will, hopefully, enhance your search results based on a collection of features that characterize your search, user profiles, behavior and your document collection characteristics.

If you have ever done some ML work, you are might be wondering, what makes this problem a different one from other kind of ML problems. While other problems are about classification, predictions (regressions), cluster, etc, in search the main interest is into providing the best set of results for the user, ordered by relevance.

(Image coutesy of Nikhil Dandekar)

Just imagine how powerful this technique could for your search capabilities, would not be awesome to provide personalization to your user, bringing the most relevant search results based on their preferences.

Most popular algorithms in learning to rank are RankNet, LambdaRank and LambdaMart, but there are other methods such as boosted gradient descendant variations. As a general idea all this algorithms try to optimize the most common IR metrics such as NDCG, MRR and MAP.

In this article we will not enter into details of how they work, if you are interested (and you should) to know more, check the nice overview paper by C. Burges [1].

Let’s say you wanna start experimenting with this, for the purpose of this article we will use Elasticsearch and the nice plugin developed [4] by Open Source Connections, bringing Learning to ran to Elasticsearch. If you use Solr, similar functionality is available, check the documentation for more details.

Setting up Learning to Rank in Elasticsearch

First thing to need before getting started is a judgement list, this list is an evaluation, based on human experts of what is a good search, but complimented with feedback from user interactions.

It should look like this:

Initial jugement list

This is the format required by the RankLib, the library used to run the algorithms, more details on the format can be found at the Lemur Project website [5].

Next step is to collect the features need for to enhance the previous judgements, so the algorithm can do it’s work and perform the necessary arrangements and calculations. Is important to notice that as this are supervised, or semi supervised, algorithms it will be necessary to create a training set and a test set.

Using the Elasticsearch plugin you should first register a feature set, this will be the collection of features that the plugin will collect for you, this could be done with this command:

PUT localhost:9200/_ltr/_featureset/docs_features

and a body like this, with a name and the collection of features to be collected.

Feature set definition for the Elasticsearch LTR plugin

After the features has been registered, next task is collecting them and use them to enhance the judgement list that we saw earlier. To collect features we will be using the sltr query, provided by the plugin, this query will execute the feature collection. After that we should log the response, for this we will use the ltr_log command.

Is important to execute the collection as filters, so there is no affection in the scoring, you can do that with a query like the next one.

Log features in the Elasticsearch LTR plugin

see how the bool query uses the document id to retrieve the documents of interest, then apply the sltr query, selecting the expected featureset (previously created) and log the result back to the requester.

So now hopefully we have an extended judgement list, this list will looks something like:

Enhanced judgment list in the Rank-SVN format

As we saw earlier, this is the format supported by Ranklib [5], the same used by SVM-Rank and the LETOR datasets [6]. Also this format can be used by XGBoost [7] and other libraries used for ranking.

The enhanced judgement list will be the bases used to train the ML algorithm, every time we need to generate a new model we will require an updated judgement list with features.

We arrived very far, last step before we can use Learning to Rank in Elasticsearch will be to generate the models using RankLib, this can be done like this:

ranklib model definition in the Elasticsearch LTR plugin

where model will be the content of the file generated by the algorithms library.

Searching with Learning to Rank

Once here, we nearly did all the hard work, next, and last step, is to search using the model. For this we will use again the sltr query introduced by the Elasticsearch Learning to Rank plugin.

As we saw earlier LTR or Learning to Rank is introduced to improve the accuracy of your search by using Machine Learning. The ML model will be applied after the initial ranking done by the search engine, however ML models can be pretty expensive to run so we will only rescore using the model the top N results.

Searching with LTR can be done using a query like this one:

Search with LTR in the Elasticsearch LTR plugin

this query is searching for the token thriller, also we apply, after the original query is done, a rescore phase using the already known sltr query. The query must reference the previously upload model by it’s name. For more details please check the very detailed plugin documentation [9].

Recap

In this very long post we introduced how to use LTR in Elasticsearch using the nice plugin [4] developed by Open Source Connections. This article has no intention of being a fully descriptive experience, for this you can check the nice plugin documentation [9].

If you are more inclined to learn hands on, you can review a couple of demos:

For the ones like me, who like to read books I do recommend reading the

  • Tie-Yan Liu (2009), “Learning to Rank for Information Retrieval”, Foundations and Trends in Information Retrieval, Foundations and Trends in Information Retrieval, 3 (3): 225–331, http://www.springer.com/de/book/9783642142666

Hope you enjoyed the reading, and wanna learn more on how to improve your search relevance.

References

[1] From RankNet to LambdaRank to LambdaMART: An Overview.https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/MSR-TR-2010-82.pdf

[2] Mean reciprocal rank Wikipedia Page. https://en.wikipedia.org/wiki/Mean_reciprocal_rank

[3] Discounted cumulative gain Wikipedia Page. https://en.wikipedia.org/wiki/Discounted_cumulative_gain

[4] The Elasticsearch LTR plugin. https://github.com/o19s/elasticsearch-learning-to-rank

[5] Ranklib file format. https://sourceforge.net/p/lemur/wiki/RankLib%20File%20Format/

[6] LETOR: Learning to Rank for Information Retrieval. https://www.microsoft.com/en-us/research/project/letor-learning-rank-information-retrieval/?from=http%3A%2F%2Fresearch.microsoft.com%2Fen-us%2Fum%2Fbeijing%2Fprojects%2Fletor%2F

[7] XGBoost, eXtreme Gradient Boosting. https://github.com/dmlc/xgboost

[8] The Lemur Project. https://sourceforge.net/projects/lemur/

[9] The Elasticsearch LTR plugin documentation. http://elasticsearch-learning-to-rank.readthedocs.io/en/latest/searching-with-your-model.html