Evaluating Search: Measure It
At Twiggle, we’re all about improving the search experience. But how do we define improvement? As Lord Kelvin, one of history’s greatest scientists and engineers, said: “If you cannot measure it, you cannot improve it.”
This post, which kicks off a series on search evaluation, addresses the question of defining search effectiveness. It introduces what we measure and how we measure it. Once we have a firm grounding in ways to measure search effectiveness, we can explore how to improve it.
Precision and Recall
You know how you’re supposed tell the truth, the whole truth and nothing but the truth? It’s the same for search engines. They’re supposed to return all the results that are relevant, filtering out any results that aren’t relevant.
Two quantities measure how well a search engine keeps this promise. Precision is the fraction of results that are relevant. Recall is the fraction of relevant results that are returned. In other words, a recall of 1 means the results include the whole truth, while a precision of 1 means the results include nothing but the truth.
In practice, neither of these is equal to 1. Tuning a search engine is a trade-off: increasing precision decreases recall, and vice versa.
All search results may be equal, but some are more equal than others. Searchers pay the most attention to the first result, then the second, etc. For web search, there’s research showing that 90% of searchers don’t look past the first page, and half of searchers don’t even look beyond the first three results.
Hence, precision matters most for the top-ranked results. We can capture this idea by refining the precision metric. For example, we can compute the precision for the top 10 results, which we call precision @ 10. If 5 of the top 10 results are relevant, then precision @ 10 is 50%.
In general, computing precision @ k for an appropriate value of k more closely models precision as the user will experience it. We can go further by averaging precision @ 1, precision @ 2, … up to precision @ k. This average precision at k gives the most weight to the relevance of top-ranked results. For example, if the first 5 of 10 results are relevant, then average precision at 10 is (1 + 1 + 1 + 1 + 1 + 5/6 + 5/7 + 5/8 + 5/9 + 5/10) / 10 = 0.82.
A similar measure is discounted cumulative gain (DCG) that weighs the relevance of each results based on the logarithm of its position. DCG is useful when relevance isn’t binary, since it can represent gradations of relevance between completely relevant and completely irrelevant.
The above metrics are useful, but they require human supervision — that is, people to supply ground truth and compare the search engine’s results against it. We’ll discuss human evaluation in future posts, but suffice to say for now that it can be expensive and tricky to collect robust human relevance judgements.
An alternative is to measure searcher behavior, such as clicks. We can treat clicks as implicit relevance judgments — that is, we can treat the searcher’s decision to click on a result as an implicit judgment of the result’s relevance to the search. We can then measure the clickthrough rate (CTR) as the fraction of searches that receive clicks and look at the mean reciprocal rank (MRR) of clicks to give more weight to clicks at earlier positions. We can also look at conversions: they will be sparser than clicks, but they can represent an even stronger relevance signal.
It’s a lot easier and cheaper to measure searcher behavior than to collect explicit human relevance judgments, especially at scale. But measures like clicks are only a proxy for relevance, and treating them as ground truth can introduce bias into evaluation.
In future posts, we’ll discuss the tradeoffs between collecting explicit human judgments and treating clicks and other search behavior as implicit judgments.
The Big Picture
Measures like precision and recall are useful signals, but they don’t capture the holistic search experience. It’s important to measure the overall search experience, as well as the effectiveness of the individual elements that comprise it.
Measuring the overall search experience is hard. If searchers have tasks (e.g., looking to buy a new dress for a holiday party), we’d like to measure whether and how efficiently searchers are able to complete those tasks using the search engine. While it’s possible to measuring task completion in a laboratory environment, it’s often better — and easier — to measure behavior, such as what fraction of sessions result in a purchase.
It’s also valuable to measure the individual elements that comprise the search experience. For example, we can measure the recall and precision of spelling correction based on how often the search engine detects misspelled queries and how often it succeeds in correcting them. These fine-grained measurements can drive incremental product improvements.
Measure and Improve
If you cannot measure it, you cannot improve it. By using a combination of analytics and human judgments to measure search effectiveness, we can identify the best opportunities to improve it. In the next posts, we’ll dive into the details of how to do so.