Real-Time Evaluation of Information Retrieval (IR) Systems

Published in

Data Science at Microsoft

9 min readNov 29, 2022

How hard is it to find an authentic recipe from Valencia?

Information retrieval (IR) systems are ubiquitous in our day to day. Maybe it’s settling bets by searching for the answer online, locating your favorite products on Amazon, finding any email in Outlook, or perhaps you’re just hungry and want to cook a delicious authentic paella. Our life would probably not be as easy (and not as much fun) without all those search boxes that allow us to find all sorts of documents and information, be it written, audio, photos, videos, or something else. However, for a user to find an IR system useful, the system needs to provide a relevant answer in a timely manner. Therefore, how do we make sure a search engine provides the users with relevant content according to their queries when we have a “cold start” scenario, i.e., no history of queries? And how do we evaluate the live performance of an IR system when it is already in production?

In this blog post we will demonstrate how we chose to evaluate such a system using online monitoring and log analysis with Kusto, a Microsoft querying tool, but these concepts can be applied to other languages and tech stacks as well.

Evaluating IR systems

Evaluation of IR systems has been extensively studied and documented, which means that a few standard metrics have been defined, a good overview can be found at Evaluation Metrics For Information Retrieval.
For example, some of these standard metrics are:

Precision
It quantifies the proportion of the results returned by the system that are relevant for the query (i.e., search term). For example, in the image above we searched for the recipe of an authentic paella. The search engine retrieved 330 million documents. If we had a reference that said that 200 million of such documents are relevant to the recipe search, then the query precision will be 200/330 = 60.6%.

Recall
It measures the proportion of relevant results that are returned by the system from the set of all relevant results. Back to our example, what if the search engine missed 20 million relevant documents? Then the recall will be 200/(200+20) = 90.9%

Reciprocal ranking (RR)
It measures where in the list of returned results is the first relevant document. Since it is convenient to have metrics with values ranging between 0 and 1, with 1 representing a better score than 0, the reciprocal ranking is defined as 1 over the highest index of a document that is relevant to the query. To measure it over a set of queries, we just average the reciprocal ranking over the query set, obtaining this way the Mean Reciprocal Ranking (MRR), or in an equation form:

where *rank(i)* is the rank position (counting from 1) of the **first** relevant document for the *i-th* query.

Since we are still hungry and are not going to go over 330 million recipes, the RR will look at the first document we marked as relevant. Let’s say it was the 5th recipe, then the RR for the query will be 1/5.

Easier said than done

As for most machine learning problems, evaluation metrics are usually computed on an offline pre-defined, annotated, curated evaluation dataset, which is hopefully representative enough of the system behavior in the wild. However, what if there is no available dataset that is representative enough of the users’ needs? Or what if those needs are unknown?

Online metrics — let’s collect feedback
An alternative to using a predefined evaluation dataset to assess the system quality is to leverage information about the system usage. By collecting information about the user journey in the IR system, we are able to evaluate and monitor the returned results. For the sake of an example, let us consider the following user flow in an application:

*Generic overview of a search journey. At any step users can go back to the previous steps. Progress can be monitored by the event’s timestamp. Image created by the authors.*

Input query: The user inputs a query (“authentic paella recipe” in our example) into the application, which is passed to the search engine.
Document browsing: The system returns a list of documents, and the application displays them sorted by their relevance.
Document inspecting: The user browses the returned documents in order and selects one to inspect it further.
Success: If the user is happy with the inspected document, they can inform the application that the search has finished. One option is to rate the result positively. Another is to use proxy methods to infer that the search was successful, such as measuring the time the user spent inspecting the document.
If the user is not happy with the inspected document, then they return to browsing the results (point 4) and may select a new document to inspect.

Adding logging to our application will provide us with insights on how the user is interacting with the system. One handy tool we used is Azure AppInsights and its Python SDK. As shown in the code below, once the logging handlers are set up, logging the user behavior is simple by defining the logging level depending on the severity of the event to be logged.

Kusto to the rescue

Observability mechanisms such as logging generate a huge amount of data, which poses the additional challenge of querying it.

Logs can be inspected in the Azure AppInsights portal using Azure Data Explorer and, specifically, its Kusto query language. If you are familiar with SQL, you will recognize quite a few common points in this language. However, there is a significant difference between Kusto and SQL — in Kusto, statements are executed in order, making the queries shorter and more readable than SQL queries.

Evaluation metrics using user event logging

Traditionally, logging has been extensively used for monitoring and engineering purposes. A few examples include knowing how many queries per unit of time the system needs to serve, how many users are querying the system, or how much time a user spends inspecting a document on average. However, Data Scientists can also benefit from appropriate logging to calculate Key Performance Indicators (KPIs) and to evaluate the system’s performance online in a continuous manner.

Mean Reciprocal Rank (MRR)

From the IR evaluation metrics mentioned above, Reciprocal Ranking (RR) only requires the rank of the first relevant result in the retuned document list, which we can define through end user interaction. Therefore, it can be computed on real traffic once appropriate logging is in place. In our example when an OnSuccess event is triggered it means that the user considered a result relevant. From a user experience perspective, the RR for a given query informs us of how quickly the user could find a first document that is relevant for their needs.

The MRR can then be computed by averaging the RR over a meaningful time period, for example a day, across a statistically significant user population so that different queries and user profiles are aggregated and represented in the metric. This way we can compare that, for example, users from Spain find the 1st relevant recipe on the 15th document on average and users from Israel find it on 3rd document on average.

The following Kusto query computes exactly that.

A quick Kusto to SQL cheat sheet (and remember, Kusto executes in order):

[Kusto] -> [SQL]
where -> where
project -> select
summarize -> a select with any aggregate function like sum, avg, or min, along with the group by columns
make-series -> a feature not available in SQL that turns the dataset into a time series plot
render -> another feature not available in SQL that plots the data

Let us break down line 4 as it is the most interesting one to review how we computed the Reciprocal Rank metric.

Given a query “authentic paella recipe” and 10 documents which were returned by the search engine, we order the documents so that the most similar document to the query is at rank 1 and the least similar has rank 10. For the sake of the example, let’s assume that documents 3, 6 and 8 were found relevant by our users.

Documents in green are marked by user as relevant.

arg_max (ExprToMaximize, * | ExprToReturn [, ...])

arg_max returns a row in the group that maximizes ExprToMaximize, and the values of columns specified in ExprToReturn. In this case, we are looking for the row that maximizes 1/Rank and `*` returns the entire row.
In line 5 we calculate the daily average of Reciprocal Rank and then we plot it.

Funnel metrics

The Reciprocal Ranking metric offers us a limited view on how the user is interacting with the system, as it only looks at the first relevant document. For instance, we can complement this metric by computing how many documents a user needs to inspect on average to find all the documents that they might need. In an ideal scenario, we would like the user to open only those documents that are indeed relevant, as we would be saving the users a lot of time going through irrelevant documents.

Using the logging introduced in this system, we can rephrase this question as a metric: from those interactions causing an onNavigate event, which is the proportion that also triggered an onSuccess event? Or, more simply, what percentage of the documents that the user read did they interact with? We can write it using Kusto as follows:

You can have a look at the complete Kusto to SQL cheat sheet for the query interpretation.

Putting it all together — the dashboard

To track these and other metrics, it is convenient to build a dashboard. In our example, we monitor the metrics using the AppInsights dashboard feature. We track the monitoring metrics as well as IR performance metrics in a single view as all information can be easily obtained from the system logs. All code is included in the following azure-samples repository

AppInsights dashboard monitoring usage and performance of a search index

This dashboard allows us to track system and performance metrics simultaneously. However, it doesn’t necessarily allow us to diagnose why the performance of our metrics deteriorates. For example, if “Daily RR” decreases over time, is it because:

The user needs have changed?
New documents have been indexed and the new content does not match the old one?
Our library of documents is outdated and needs to be updated with new documents?
User’s definition of success has changed?
UX changes affect how users interact with the results? For example, a topic widget appears on the results page and displays information, eliminating the need to interact with a specific document.
Seasonality?
Other?

Having these metrics on a dashboard is a great way to be alerted as performance changes but work still needs to be done to understand the “why” of any performance changes.

Conclusion

We were motivated to write this article because we encountered a scenario where pre-existing annotated evaluation data for the search engine was not available. We found that tracking the interactions of users with the system over time provides hints on where additional adjustments are in need. The Reciprocal Ranking visualizes how far the user needs to scroll in order to find the first relevant document, while the ratio between relevant and open documents sheds light on the overall performance of the search engine and relevance of the returned documents.

Designing appropriate application logging and analysis enables real time tracking of the quality of the system from the minute the system is running in production. This is not only useful for cases where evaluation resources are not available, but also allows us to spot variance in the quality of the system over time, and therefore improve the user experience. And now, we’re going to make a paella.