How to Measure the Relevance of Search Engines

4 min readJun 9, 2015

By Nikhil Dandekar, Engineering Manager, Quora

How good are we?

That’s one of the biggest questions that search engines struggle to answer early on.

A part of the answer is user-engagement metrics. If you have built a good instrumentation and logging pipeline, you can see how your users interact with your search engine. You can see users performing searches and clicking on results. You can calculate aggregate metrics from this user engagement data, like Clickthrough rates (CTR), Conversions, Abandonment rates etc. that give you some indication of your search quality.

But all of these metrics only give you a myopic view of your actual quality. You still struggle to answer questions like:

How good is your overall relevance?
Where do you stand with respect to your competitors?
Over time, is the gap between you and your closest competitor narrowing down? Or widening up?
What are your biggest strengths? Are there certain types of searches where you perform better than your competitors?
Similarly, what are your biggest weaknesses?

Enter human relevance measurement systems.

Human Relevance Measurement

To answer these fundamental quality questions, most major search engines invest in a human relevance measurement system which acts as an oracle for correctness, and lets you measure your search quality in an objective manner.

A very simplified version of a human relevance measurement system can be built something like this:

Generate a sample of a few thousand search terms that users issue on your search engine. The idea here is to get a representative sample of the searches that you, and your competitors, expect to get.
Issue those searches on your search engine and extract the top few results. Also extract results for the same searches for each of the competitors that you care about.
Train a set of human raters to rate the quality of these results. A simple rating process might involve following a set of guidelines which define what an excellent/average/bad search result is. Usually, it’s much more nuanced than that, and the raters are trained to rate the searches and results based on a bunch of different factors. Check the Additional Resources below for the 160-page long Google rating guidelines. These ratings are blind, which means that the raters don’t know which search engine the results are from. This helps eliminate bias that may arise due to factors such as reputation, brand name etc.
Repeat the “extract results — rate results” steps at regular intervals to ensure you have the freshest set of results and the ratings for these.

This kind of a human relevance measurement system gives you an oracle of what the best results for your representative set of searches are. This data helps search engines in a bunch of ways, including:

You can aggregate these ratings and calculate metrics, like NDCG, that tell how good your overall search quality is. This is similar to how the Dow Jones Industrial Average tells you how the stock market is doing by using just a select set of representative company stocks.
You can calculate the same quality metric for your competitors and see how good they are and how they compare against you.
You can also calculate this quality metric on various segments of your search set (e.g. news searches vs. navigational searches vs. local intent searches etc.) and see how each of these segments perform. These tell you what your strengths and weaknesses are. E.g. you might see that you consistently underperform for “news” searches compared to your competitors. Going with the stock market analogy, having a targeted set for a search segment is similar to a stock index which tracks a specific sector of companies such as the NASDAQ-100 Technology Sector Index, which tracks tech companies.
The quality metric lets you see how you improve over time as you update your search ranking algorithms. It also lets you see how your competitors improve over time.
This data also serves as a really good training data for training and evaluating your ranking models. In layman’s terms, this means you model your search ranking algorithm in a way that it returns the higher rated results above the lower rated results for as many searches as possible and in a way that generalizes to any search that can be issued on your search engine.
Optimizing your ranking algorithms for these ratings also helps make your search engine more robust to the manipulation of SEO practitioners. E.g. Lets say instead of human quality ratings, the search engine uses the number of clicks on a search result as a measure of how good a result is. A “click-based” system like this could be easily exploited for SEO, and websites could boost the rank of their crappy results just by paying lots of people to click on their result lots of times.

Conclusion

A human relevance measurement system is fundamental to building a great, long-lasting search engine. With crowdsourcing platforms like Amazon Mechanical Turk and Crowdflower, building these systems has become easier and relatively low-cost for startups.

A big shortcoming of these systems is that it’s hard to measure personalized relevance. It’s relatively easy for human raters to use a universal definition of search quality, and rate results based on that. But it is really hard to rate results based on some individual’s perception of search quality. Because of this, it’s proven harder to extend human relevance measurement systems to recommendation engines, where personalization is much more important. But maybe that’s the next challenge waiting to be tackled.

Additional resources

The complete Google rating guidelines (160 pages)
Bing rating guidelines

This post was adapted and extended from my Quora answer for How does Google tell if their search results are correct?

How to Measure the Relevance of Search Engines

Human Relevance Measurement

Conclusion

Additional resources

Written by Nikhil Dandekar