Human labels vs clicks for training a machine learned ranking model

3 min readJun 16, 2016

Say you want to train a machine learned model to perform a ranking task. A common starter question is: Should I use human relevance judgments or clicks as the target variable for training the model?

The answer depends on a bunch of factors such as what you are trying to optimize for, how mature your system is, how much resources you have etc.

Let’s run through the advantages / disadvantages of each approach.

Using clicks (or other online metric) for training

If you have an existing site, you probably are already logging clicks and other user actions. This means you can get started with using clicks for training very quickly. It’s a quick, low-cost solution.
However clicks often are the wrong metric to optimize for. Clicks are heavily influenced by things like your UI, the “attractiveness” of the result etc. Clicks are essentially a “shallow” metric and don’t tell you whether the user liked the result or not.
If you can derive a metric from user actions that correlates well with a good result, you should use that instead of clicks. E.g. for an e-commerce site, “conversions” (whether the user added an item to the cart or bought an item) is a much better online metric to optimize for instead of clicks.

Using explicit relevance judgments for training

Getting good explicit relevance judgments is usually not easy. You need to come up with clear guidelines on what makes a good result and train your judges on those guidelines. Even after that, you can’t assume that your judges will follow the guidelines correctly, so you need to build your systems to be robust to that. E.g. you might have to do continuous audit and removal of bad judges or use multiple judges per rating. All of this takes time and resources to implement, which might be hard if you are short on either. On top of that, if your results are personalized or contextualized in any way, getting explicit relevance judgements on it is much, much harder.
However, if you can solve all the challenges from pt 1 above, explicit relevance judgments give you a higher quality label to optimize for than just clicks. So if you really care about relevance, and you can’t measure relevance properly using online metrics, investing in an explicit relevance judgement system is usually worth it.

Given this, a good rule-of-thumb for choosing the right optimization variable for your ranking model would be:

If you can come up with a good, unambiguous online metric to optimize for (e.g. “conversions” for an e-commerce site etc.), use that for training.
If you can’t do 1, but can overcome all the challenges with building an explicit relevance judgment system above, use explicit relevance judgments.
If you can’t do either 1 or 2, use clicks.

Originally published on Quora.

Written by Nikhil Dandekar