How To Run Efficient Search Experiments

Jamal Zabihi
Loopio Tech
Published in
8 min readFeb 1, 2023
Photo by Markus Winkler on Unsplash

When was the last time that you searched for a piece of information? A few hours ago? Or maybe even a few minutes ago? Search has become an integral part of everyday life. Whether you want to make a new recipe, learn more about a scientific fact, or even look for an app on your phone, you rely on your favourite search tool and expect to find accurate and relevant results quickly.

Here at Loopio, we are committed to providing our customers with fast and relevant search and discovery experiences, powered by artificial intelligence (AI). To achieve this goal, we have put effort into accelerating the process of ideation, innovation, and experimentation.

Online experimentation methods are often used to evaluate the usefulness of new product ideas, but these are highly costly and time consuming. Could there be an easier way to evaluate new ideas without online experimentation? Could it be done in a matter of hours instead of weeks? Well, the answer is YES. This post will describe how Loopio makes use of an experimentation framework in order to quickly test ideas in an offline fashion and pick the best ones to run as an online experiment.

Before getting into offline experimentation, let’s take a step back and briefly discuss search engines and the metrics used to measure their performance.

Search Engines

One of the popular areas of research in the field of information retrieval is search engines. A search engine is a piece of software that looks for documents. Despite my oversimplified definition, search engines could become very complex to meet different needs of those users that are looking for information. Google is the most popular search engine and the one that many people are familiar with. However, we use many other search engines on a daily basis without even realizing it. Every search box in any website has a search engine behind it — searching on Amazon, looking for your next binge series on Netflix, or looking for a meal on SkipTheDishes.

There are different types of search engines and we can categorize them in various ways:

  • Elasticsearch: A popular search engine that looks for matches for the string of symbols you type in. Elasticsearch is an extremely fast full text search engine. If you want to learn more about this search engine, I recommend reading through these tutorials.
  • Solr: An open-source and popular search engine which is used as the primary search engine in many websites including Reddit. Solr offers many powerful search functionalities similar to Elasticsearch.

Regardless of what search engine one uses, the end goal is the same: surfacing accurate results. Taking the Google search page as an example, as a user I expect to find an answer to my question within the top ranked results. I don’t want to go through pages of links and information to finally find an answer. That’s why people rarely move to the next pages of Google. A search engine enthusiast once said:

“The best place to hide a dead body is in the second page of Google search”

In order to understand the effectiveness of a search engine, we need to be able to measure its accuracy and in order to be able to measure its accuracy, we need a metric. So let’s go through a few popular objective functions for search engines.

Search Metrics

Search systems have a particular goal: placing the relevant item(s) towards the top of the list of results. Therefore we need rank-aware metrics to evaluate the performance of final results rankings.

There are quite a lot of rank-based metrics that are being used in the world of search but the most popular ones are the following:

  • Mean Reciprocal Rank (MRR)
  • Mean Average Precision (MAP)
  • Normalized Discounted Cumulative Gain (NDCG)

Mean Reciprocal Rank (MRR)

This is the simplest of rank-aware metrics which tries to locate the position of the first relevant result. This metric is useful if we have binary labeled data (relevant vs irrelevant). MRR does not care about the position of other relevant items (if any).

MRR example

Mean Average Precision (MAP)

MAP is an extension of the MRR metric. We are still dealing with binary labels similar to MRR but the difference here is that MAP evaluates the position of other relevant items as well as the first one. So in the case of having more than 1 correct answer for a searched query, having all of them at the top of the list gives us the best MAP score.

MAP example

Normalized Discounted Cumulative Gain (NDCG)

Similar to MAP, the NDCG metric values placing relevant items high up the list of results. The main difference between the two is that NDCG can work with non binary labels as well. In certain use cases, we may want to introduce a range of labels for example from 0 to 3. The closer the label is to 3, the more relevant it is for a given query. If we have data with such fine-grained ratings, we should use the NDCG metric.

  • Gain: The relevance score for each item
  • Cumulative gain: Sum of gains at K items
  • Discounted Cumulative gain: It weighs the relevancy score based on the position of items. The items at the top get a higher weight
  • Normalized Discounted Cumulative gain: DCG with a normalization factor
NDCG example

Online Experimentation Limitations

Now that we have a better understanding of search engines and the metrics used to evaluate them, let’s discuss the way we can experiment with such systems and evaluate the applicability of different ideas and features given the metrics above.

Online experiments are well-established as a best practice in the tech industry to measure the value of new product features. A/B tests are the most popular form of online experimentation. However, relying only on online experiments can be problematic. Here are the main limitations of online experiments:

  • Data collection is slow: In order to complete an online A/B test successfully, a certain amount of data needs to be collected. Depending on how the test is set up and the amount of data required to have a conclusive experiment, it can take at least a few weeks to gather that much data. Setting up online tests for every single new idea/feature and waiting for weeks to wrap up the experiment is a time consuming process.
  • Implementation cost is high: For every single feature that we want to test in an online fashion, engineers need to implement it properly with production-ready code. This process takes a lot of time and resources and can add further delays to the whole workflow of testing new ideas.
  • Running untested online tests can be risky: Since the implemented online features have not been evaluated beforehand, there is a risk of negatively impacting the user’s experience.

Offline Evaluation Framework

Openness to experimentation and data-driven decision-making fuel the growth of our product at Loopio. Teams and individuals are encouraged to push beyond their comfort zone to innovate, experiment, and have a direct impact on bringing the vision of Loopio to life. To accelerate this process we rely heavily on our in-house offline evaluation framework. In this framework we take advantage of historical labeled data to simulate the performance of new search experiments in an offline fashion. We then narrow down the list of optimal solutions which are worth pursuing in an online experiment later on.

There are four main advantages of having an offline evaluation framework:

  • Faster iteration of solutions: As opposed to the lengthy nature of online experiments, we are able to run offline experiments much faster. This allows us to iterate quickly and evaluate a variety of different potential solutions for a given problem. The outcome of this process is having a higher confidence in the success of the winning solution in the next stage of our experimentation which is an A/B test.
  • Alignment on the problems, hypotheses, and metrics: Another advantage of this framework is that it allows the product, design, engineering, and data teams to align on the problem, the proposed hypothesis, and the metrics tracked to measure success. Throughout this collaboration we make sure that what is tracked in the offline experiments can be tracked in online experiments as well.
  • Estimation of expected outcome in the product: Following up on the previous point, we make sure to track similar success metrics offline and online. This helps us to have an estimate of how much improvement we can expect once this feature is released to customers. Depending on the level of improvement in the offline experiments, we can decide whether investing time and resources on implementing this solution and having an A/B test for it is worthwhile or not. Without such a framework, we would be completely in the dark on this matter and that increases the risk of testing new features in an A/B test.
  • Promotion of experimentation culture: Having this framework encourages team members to test their ideas quickly and cheaply. Furthermore, it helps us to translate the customers’ feedback/suggestions into offline experiments and evaluate them. Throughout this process, some experiments may not result in an improvement of our tracked success metrics but what matters the most is to promote this culture of turning ideas into experiments and testing them properly. This not only helps us to find the most optimal solutions to our problems, but also ensures we do not dedicate time and resources to implement a solution which is not as impactful.

Since we implemented our offline evaluation framework, we were able to complete approximately 150 experiments which averages out to approximately 3–4 experiments per week. This ranges from simple experiments where we tweaked the settings of our search engine to more complex use cases of applying machine learning capabilities in our search workflow. Without this framework, running this many experiments would have taken a significant amount of time. Out of these experiments, we were able to recognize which solutions have the potential of improving our customers’ experience. We then started to run A/B tests for those winning offline experiments which resulted in the successful release of multiple impactful features to our customers. The figure below illustrates the process of offline and online experimentation and how they are connected at a high level.

Offline and online experimentation workflows

Conclusion

This was the first of a series on search-related blog posts. In this article, I introduced the latest addition to our experimentation toolbox: an offline evaluation framework. Relying on offline experiments and combining them with online experiments have helped Loopio to accelerate the proper release of new features to our customers.

Stay tuned for the next search article where I will be focusing on one specific offline search experiment that we completed using this framework: a use case for applying machine learning models in search.

If you are interested in joining Loopio, check out our current open roles across Engineering, Product, and Design teams.

--

--