Searching for the right information has always been difficult. Not so long ago, relevant documents were stored in physical libraries and discovering them was a lengthy and complicated process.
When documents became available through online repositories, the number of indexed documents started to grow beyond physical storage limits. The same applies to the number of products offered by e-commerce sites or to content available through online streaming services.
Users tend to prefer finding everything in one place and most of them enjoy choosing from more relevant alternatives so service providers need to adapt. Global services (such as Google, Amazon, Netflix, Spotify), where users can find almost anything are on the rise. One of the most powerful tools driving their global dominance is their highly advanced personalization powered by machine learning techniques. These techniques are recommender systems and personalized search.
Recommender systems enable users to discover relevant documents, products, or content online. Often, items that users might like the most are hidden deep among millions of other items. Users are not able to find such items directly through the search engine because they rarely know their label or might not even be aware of their existence.
On the other hand, sometimes users are looking for a specific item and are willing to help the online system by expressing their needs in order to reduce number of possible items that can be recommended.
There are several ways to assist users within expressing their needs. User experience plays a very important role here. A lot of users access online services via their mobile phones with a limited ability to show their interest. Online services should focus on filtering possible search results by utilizing all information available.
User geo-location can narrow down possible search and recommendation results significantly. In Recombee, you can for example choose to recommend only items that are within a certain radius from the user’s location. Alternatively, you can boost the probability that an item is recommended when being geographically closer to given user.
The user might wish to filter out possible search results using specific tags or categories. It usually takes just a single click to filter-out all items other than a specific category (e.g. all documents other than sci-fi novels). One should enable users to express their interests as easily as possible.
Some percentage of users (usually below 10% of active sessions, depending on the accessibility of the search bar, quality and visibility of recommendations) are willing to narrow down the search results using a text query (even if it is just by a few characters). Their intent might be to find a specific category of items or search for a specific one directly by simply already knowing the label of the item they are looking for. The text they enter is called a user query and this blogpost is discussing how to utilize a query to assist a user in finding what she/he is looking for. The blog post starts with the theoretical part followed by a practical one.
The problem of finding appropriate items for given text query has been studied for decades as the information retrieval (IR). An information retrieval process begins when a user enters a query into the system. Queries are formal statements of information needs, for example search strings in web search engines. In information retrieval a query does not uniquely identify a single item (document) in the collection. Instead, several items may match the query, perhaps with different degrees of relevance.
Traditional methods try to match a query with documents and derive relevance based on the similarity. Machine learning methods solve the IR problem by constructing a ranking model from a training data. How such training data (for a search engine) can look like? Typically, it is a collection of “properly” ranked documents for each query.
Here is the scheme of the IR system as described in related blogpost:
The system works by matching queries and documents and computing their similarity. Most similar documents are returned ranked by similarity with the query. Similarity is computed eg. as cosine similarity of TF-IDF vectors.
Learning to rank (LTR) is an application of machine learning to rank items according to human expectation. LTR models are typically trained using human labelled data.
In the recall phase, the LTR model gets a query and subset of documents(items) produced by search engines as an input and outputs relevance for every item. Eventually, it can output a ranked list of documents (K-most relevant documents). Note that modern systems can also take user profiles as an input and perform personalized learning to rank (LTR) machine learning tasks.
What is the difference among classical predictive models, learning to rank models and recommender systems then?
- Predictive models/classifiers typically have just a few output attributes, they are not designed to rank millions of items for millions of users.
- Learning to Rank systems often return an identical list of results to given query for all users, no personalization is involved.
- Recommender systems do not use queries. They produce relevant items based on user history and similarities between users. Relevant items are computed by predicting their rating in the rating matrix or by recommending similar items based on their attributes.
Next section is useful for both LTR and recommender systems, because evaluation of models is similar as opposed to classical predictive models in machine learning.
Evaluating LTR and recommender systems
Cumulative gain measures how relevant are top k items returned by learning to rank system or a recommender system.
For example we can sum up the relevance of 6 returned items (note that item 4 is irrelevant).
The problem of CG is that it does not take position of an item into consideration. For example, the first recommendation can be displayed with a much bigger image than the other five. Also, users tend to glance at a few items on top of the list and the probability that they see items further down the list is a lot less likely. For this reason, discounted cumulative gain (DCG) is more popular than simple CG.
In DCG, relevance value is reduced logarithmically proportionally to the position of the result.
Some variants place even stronger emphasis on retrieving relevant documents on the top of the list.
Suppose a dataset contains N queries. It is a common practice to normalize the DCG score for each query and report the normalized DCG (“NDCG”) score averaged over all queries. It it pretty satisfying, having such a nice evaluation metric, but remember real world is cruel.
Traditional learning to rank algorithms
- Pointwise methods transform ranking into regression or classification of single items. The model then takes only one item at a time and either it predicts its relevance score or it classifies the item into one of the relevancy classes.
- Pairwise methods tackle the problem as a classification of item pairs, i.e. deciding whether the pair of items belongs to the class of the pairs with the more relevant document at the first position or vice versa.
- Listwise methods treat a whole item list as a learning sample. For example, by using all item scores belonging to one particular query and not by comparing pairs or single samples only.
Here are few examples of LTR algorithms:
PRank algorithm uses Perceptron (linear function) to estimate the score of a document from a feature vector of this document. The query is appended to the feature vector of the document embedding. We can also classify documents into relevancy classes (e.g. relevant/irrelevant). The function can be modelled by almost any machine learning method. Most algorithms use decision trees and forests. Modern methods utilize deep learning networks.
It is apparent that when training the model on pairs of input embeddings and the corresponding output relevance, we do not directly minimize NDCG or other above described evaluation criteria. In line with pointwise methods, pairwise and listwise methods also use proxy differentiable loss functions.
In order to understand the pairwise methods better, one should remember the cross entropy loss used in binary classification, which is heavily penalizing confident models with false prediction.
The log loss can be computed by summing up penalty for zero and one label:−(y log(p) + (1 − y) log(1 − p))
As you see, wrong confident answers get the highest loss.
Rankboost directly optimizes the classification error. It is derived from Adaboost and works on document pairs. It trains weak classifiers giving more weight to pairs that were not correctly classified by ensemble in the previous step.
RankSVM was one of the first algorithms with pairwise approach to the problem. It approaches the ranking as ordinal regression, the thresholds of the classes have to be trained as well. RankSVM employs minimization of hinge loss function. It also allows direct use of kernels for non-linearity.
Motivation for listwise methods
Pairwise methods are very good, but they have drawbacks as well. Training process is costly and there is an inherent training bias that varies largely from query to query. Also only pairwise relationships are taken into account. We would like to use a criterion that allows us to optimize full list at simultaneously taking into account relevance of all items.
How to get training data for the LTR system?
Obtaining training data for the LTR system can be a lengthy and costly process. You typically need a cohort of human annotators entering queries and judging the search results. Relevance judgement is also difficult. Human assessors estimate one of the following scores:
- Degree of relevance — Binary: relevant vs. irrelevant (good for pointwise)
- Pairwise preference — Document A is more relevant than document B.
- Total order — Documents are ranked as A,B,C,.. according to their relevance. (perfect for listwise, but exhaustive and time consuming)
It is apparent that human annotations are costly and their labels are not very reliable. One should therefore derive ranking and train the system from behavior of users on the website.
Even better approach is to replace the above described LTR algorithm with a recommender system.
Personalized search revisited
When search results are sorted according to the preferences of individual users, the overall satisfaction with the search functionality increases significantly.
Instead of applying classical learning to rank machine learning (LTR) models, as described above, Recombee’s solution combines search engine with our powerful recommender system. There are several advantages of such an approach and we will examine them in followup blog posts.