Evaluating Search: Using Human Judgement

Originally posted on the Twiggle blog.

In the previous post, we looked at measuring searcher behavior in order to evaluate search engine performance. Measuring searcher behavior is valuable, but a robust evaluation process also involves collecting explicit human judgments. In this post, we’ll look at the use of explicit human judgments for evaluation.

The premise of using human judgments for evaluation is simple. Human evaluators perform a set of tasks in which they’re given a search query and a search result, and asked whether the result is relevant to the query. The judgment can be binary (i.e., relevant vs. not relevant), or it can allow for varying gradations of relevance (e.g., a scale from 1 to 4). Using binary vs. graded relevance is a trade-off which we’ll discuss in a moment.

By collecting these judgments on search results for a representative sample of search queries, we can establish robust estimates of precision @ k or discounted cumulative gain (see previous post for explanations of these).

A key benefit of human judgments is that they are explicit relevance signals.

Benefits of Human Judgments

A key benefit of human judgments is that they are explicit relevance signals. When we measure clicks and conversions, we’re using these as proxies for relevance judgments. But sometimes those proxies mislead us. Searchers may click on irrelevant results out of curiosity, or they may decide not to click because they learn all they need to know from the search results page. Conversion have fewer false positives than a relevance signal, but they have many false negatives: a lack of conversion doesn’t mean the result was irrelevant.

Explicit judgments eliminate the noise introduced by using behaviors as proxies. Explicit judgments also make it possible to test a new search engine before releasing it, avoiding embarrassment or worse.

Another benefit of human judgments is that we can use them to evaluate individual parts of the search engine in isolation. For example, we can use human judgments to judge how well the search engine is understanding the query, separately from how well it ranks results. Or we can use human judgments to evaluate spelling correction. It’s much harder to perform this kind of fine-grained evaluation using behavior, since behavior tends to reflect the end-to-end relevance rather than any one part of the search process.

Disadvantages of Human Judgments

Given these advantages, it’s tempting to rely entirely on human judgments. But human judgments have their disadvantages.

While collecting data about search behavior is essentially free, human judgments cost money, whether the evaluators are in-house employees or crowdsourced workers on platforms like Crowdflower and Mechanical Turk. And there are costs associated with task design and quality assurance. Specifically, quality assurance requires assigning each task to multiple evaluators, which significantly increases costs.

The other problem with human judgments is that an evaluator may not be able to figure out what the searcher was looking for. For example, the evaluator’s understanding of “fancy shirt” might not match the searcher’s. Unless the evaluator knows more about the searcher, human judgment relies on the evaluator being able to make objective relevance judgments just by looking at the query.

Best Practices

Human judgments are useful, but using them effectively can be tricky. Here are some best practices:

  • Try to use binary relevance. In general, binary relevance minimize the cognitive load for evaluators and leads to a higher-quality signal. More granularity means more effort per task for evaluators and more opportunity for variance — which leads to more need for quality assurance and higher costs. If you really need graded relevance to capture differences, use as few grades as possible, and make sure your evaluators agree on them.
  • Keep the tasks objective. Don’t ask your evaluators to arbitrate matters of personal taste. There’s no such thing as pure objectivity, but a high variance among evaluators is usually a sign that you’re asking them a subjective question. One of the advantages of binary relevance is that it helps discourage subjectivity: it encourages evaluators to see the task as black or white.
  • Use human judgments to isolate components. As discussed earlier, human judgments are a great way to evaluate specific search engine components, like query understanding and spelling correction. Having regular access to such judgments can help prioritize investments in improving search quality.

It’s easier and cheaper to measure searcher behavior than to collect explicit human relevance judgments. But explicit human judgments serve more robust signals of relevance, and they’re especially useful for evaluating individual components of the search process in isolation.

So use both! A robust evaluation process combines measuring searcher behavior with collecting explicit human judgments.

Previous post: Evaluating Search: Measuring Searcher Behavior