Introduction to personalized search

Published in

Recombee blog

9 min readMar 24, 2020

Searching for the right information has always been difficult. Not so long ago, relevant documents were stored in physical libraries and discovering them was a lengthy and complicated process.

When documents became available through online repositories, the number of indexed documents started to grow beyond physical storage limits. The same applies to the number of products offered by e-commerce sites or to content available through online streaming services.

Users tend to prefer finding everything in one place and most of them enjoy choosing from more relevant alternatives so service providers need to adapt. Global services (such as Google, Amazon, Netflix, Spotify), where users can find almost anything are on the rise. One of the most powerful tools driving their global dominance is their highly advanced personalization powered by machine learning techniques. These techniques are recommender systems and personalized search.

Recommender systems use a history of users interacting with items to produce ranked list of most relevant items for any given user. Search engine ranks items based on similarity with given query regardless of user history.

Recommender systems enable users to discover relevant documents, products, or content online. Often, items that users might like the most are hidden deep among millions of other items. Users are not able to find such items directly through the search engine because they rarely know their label or might not even be aware of their existence.

On the other hand, sometimes users are looking for a specific item and are willing to help the online system by expressing their needs in order to reduce number of possible items that can be recommended.

There are several ways to assist users within expressing their needs. User experience plays a very important role here. A lot of users access online services via their mobile phones with a limited ability to show their interest. Online services should focus on filtering possible search results by utilizing all information available.

User geo-location can narrow down possible search and recommendation results significantly. In Recombee, you can for example choose to recommend only items that are within a certain radius from the user’s location. Alternatively, you can boost the probability that an item is recommended when being geographically closer to given user.

The user might wish to filter out possible search results using specific tags or categories. It usually takes just a single click to filter-out all items other than a specific category (e.g. all documents other than sci-fi novels). One should enable users to express their interests as easily as possible.

Some percentage of users (usually below 10% of active sessions, depending on the accessibility of the search bar, quality and visibility of recommendations) are willing to narrow down the search results using a text query (even if it is just by a few characters). Their intent might be to find a specific category of items or search for a specific one directly by simply already knowing the label of the item they are looking for. The text they enter is called a user query and this blogpost is discussing how to utilize a query to assist a user in finding what she/he is looking for. The blog post starts with the theoretical part followed by a practical one.

Information Retrieval

The problem of finding appropriate items for given text query has been studied for decades as the information retrieval (IR). An information retrieval process begins when a user enters a query into the system. Queries are formal statements of information needs, for example search strings in web search engines. In information retrieval a query does not uniquely identify a single item (document) in the collection. Instead, several items may match the query, perhaps with different degrees of relevance.

Traditional methods try to match a query with documents and derive relevance based on the similarity. Machine learning methods solve the IR problem by constructing a ranking model from a training data. How such training data (for a search engine) can look like? Typically, it is a collection of “properly” ranked documents for each query.

Here is the scheme of the IR system as described in related blogpost:

Classical IR system is not personalized, it just returns most relevant documents for a query q. Machine learning is often not needed as the system follows predefined procedures (such as TF-IDF similarity lookup).

The system works by matching queries and documents and computing their similarity. Most similar documents are returned ranked by similarity with the query. Similarity is computed eg. as cosine similarity of TF-IDF vectors.

Search results can be improved by re-ranking (using a machine learning model). In this example search engine is also used to reduce the number of candidate items for the machine learning model so the scoring is faster.

Learning to rank (LTR) is an application of machine learning to rank items according to human expectation. LTR models are typically trained using human labelled data.

In the recall phase, the LTR model gets a query and subset of documents(items) produced by search engines as an input and outputs relevance for every item. Eventually, it can output a ranked list of documents (K-most relevant documents). Note that modern systems can also take user profiles as an input and perform personalized learning to rank (LTR) machine learning tasks.

What is the difference among classical predictive models, learning to rank models and recommender systems then?

Predictive models/classifiers typically have just a few output attributes, they are not designed to rank millions of items for millions of users.
Learning to Rank systems often return an identical list of results to given query for all users, no personalization is involved.
Recommender systems do not use queries. They produce relevant items based on user history and similarities between users. Relevant items are computed by predicting their rating in the rating matrix or by recommending similar items based on their attributes.

Next section is useful for both LTR and recommender systems, because evaluation of models is similar as opposed to classical predictive models in machine learning.

Evaluating LTR and recommender systems

Cumulative gain measures how relevant are top k items returned by learning to rank system or a recommender system.

For example we can sum up the relevance of 6 returned items (note that item 4 is irrelevant).

Items displayed to a user have seldom uniform visibility. E.g., in e-commerce, visibility of recommended items decreases sharply as most of the users do not want to scroll down the list. In the media domain, one item is often highlighted and others are harder to spot.

The problem of CG is that it does not take position of an item into consideration. For example, the first recommendation can be displayed with a much bigger image than the other five. Also, users tend to glance at a few items on top of the list and the probability that they see items further down the list is a lot less likely. For this reason, discounted cumulative gain (DCG) is more popular than simple CG.

In DCG, relevance value is reduced logarithmically proportionally to the position of the result.

Discounted cumulative gain can be easily computed as demonstrated in the above example.

Some variants place even stronger emphasis on retrieving relevant documents on the top of the list.

Suppose a dataset contains N queries. It is a common practice to normalize the DCG score for each query and report the normalized DCG (“NDCG”) score averaged over all queries. It it pretty satisfying, having such a nice evaluation metric, but remember real world is cruel.

Traditional learning to rank algorithms

Pointwise methods transform ranking into regression or classification of single items. The model then takes only one item at a time and either it predicts its relevance score or it classifies the item into one of the relevancy classes.
Pairwise methods tackle the problem as a classification of item pairs, i.e. deciding whether the pair of items belongs to the class of the pairs with the more relevant document at the first position or vice versa.
Listwise methods treat a whole item list as a learning sample. For example, by using all item scores belonging to one particular query and not by comparing pairs or single samples only.

Here are few examples of LTR algorithms:

PRank algorithm uses Perceptron (linear function) to estimate the score of a document from a feature vector of this document. The query is appended to the feature vector of the document embedding. We can also classify documents into relevancy classes (e.g. relevant/irrelevant). The function can be modelled by almost any machine learning method. Most algorithms use decision trees and forests. Modern methods utilize deep learning networks.

The final ranked list is obtained by scoring all documents and sorting them according to their predicted relevance.

It is apparent that when training the model on pairs of input embeddings and the corresponding output relevance, we do not directly minimize NDCG or other above described evaluation criteria. In line with pointwise methods, pairwise and listwise methods also use proxy differentiable loss functions.

In order to understand the pairwise methods better, one should remember the cross entropy loss used in binary classification, which is heavily penalizing confident models with false prediction.

The log loss can be computed by summing up penalty for zero and one label:−(y log(p) + (1 − y) log(1 − p))

As you see, wrong confident answers get the highest loss.

Highest loss (cost) is achieved when a model predicts a high difference in relevance and there is certainly a high difference, but the opposite one. More on gradient training algorithms for LTR systems can be found here.

Rankboost directly optimizes the classification error. It is derived from Adaboost and works on document pairs. It trains weak classifiers giving more weight to pairs that were not correctly classified by ensemble in the previous step.

RankSVM was one of the first algorithms with pairwise approach to the problem. It approaches the ranking as ordinal regression, the thresholds of the classes have to be trained as well. RankSVM employs minimization of hinge loss function. It also allows direct use of kernels for non-linearity.

Motivation for listwise methods

Pairwise methods are very good, but they have drawbacks as well. Training process is costly and there is an inherent training bias that varies largely from query to query. Also only pairwise relationships are taken into account. We would like to use a criterion that allows us to optimize full list at simultaneously taking into account relevance of all items.

Advantage of exponential ranking is that even when the model f assigns similar scores to all documents, their top one probabilities will be very different — best document close to 1 and less relevant predictions close to 0.

Here, the loss is computed for a list of documents. We do not care much about irrelevant documents Py(x)=0, the highest loss is caused by relevant documents that are predicted to be less relevant. For more info on listwise loss follow this document.

How to get training data for the LTR system?

Obtaining training data for the LTR system can be a lengthy and costly process. You typically need a cohort of human annotators entering queries and judging the search results. Relevance judgement is also difficult. Human assessors estimate one of the following scores:

Degree of relevance — Binary: relevant vs. irrelevant (good for pointwise)
Pairwise preference — Document A is more relevant than document B.
Total order — Documents are ranked as A,B,C,.. according to their relevance. (perfect for listwise, but exhaustive and time consuming)

It is apparent that human annotations are costly and their labels are not very reliable. One should therefore derive ranking and train the system from behavior of users on the website.

Even better approach is to replace the above described LTR algorithm with a recommender system.

Personalized search revisited

When search results are sorted according to the preferences of individual users, the overall satisfaction with the search functionality increases significantly.

Personalized search should also take into account user preferences, historical interactions and the interactions of similar users. Why not utilize a recommender system for this? Two users can expect very different recommendations for the same search query.

Instead of applying classical learning to rank machine learning (LTR) models, as described above, Recombee’s solution combines search engine with our powerful recommender system. There are several advantages of such an approach and we will examine them in followup blog posts.

Our approach to personalized search combines a search engine and a recommender system. First, recommended items (regardless of a query) are re-ranked by a search engine to filter out irrelevant recommendations and push items that match the query with their descriptions. Secondly, the search engine returns best matching candidate items regardless of user profile or interaction history. Then, those items are re-ranked by a recommender to better suit the taste of each particular user. The final result is produced by ranked voting of the above streams.

Test out Recombee’s full-text personalized search solution with unlimited 30-days trial or free plan. More info in our documentation.