How to evaluate Search performance

First let me start with some Gyan on search

Information Retrieval systems are based on various models like Boolean, Vector Space Model, Probabilistic and Graphical Model and evaluation is required in-order to test and improve the search engines for ‘relevance’ of content of records shown in the result set.

Relevance: It is a measure of how effectively was the user’s information need met? Relevancy requires human judgement, and it answers the following questions:

– How useful were the results?
– How many of the retrieved results were useful?
– Were there any useful results that are missed out against a particular query?
– Did the order of the results make the user’s search easier or harder?

There are several evaluation strategies that are being used for testing relevance of a search engine.

However, below will discuss about Set based evaluation, it’s problems and Mean Average Precision that is widely used by TREC.

Set based evaluation

This is most fundamental and most basic evaluation strategy. The metric used for evaluation are:

Precision: This tells us how many of the retrieved results were useful?

Precision = # (relevant items retrieved)/ # (retrieved items)
 = TP / (TP+ FP)
 = Prob(relevant/ retrieved)

True Positive (TP): A retrieved document is relevant
False Positive (FP): A retrieved document is not relevant

Recall: This tells us were there any useful pages left not retrieved?

Recall = #(relevant items retrieved)/ # (relevant items)
 = TP/ (TP+ FN)
 = Prob(retrieved/ relevant)


– True Positive (TP): A retrieved document is relevant
– False Negatives (FN): A relevant document is not retrieved

It is important to note that Precision decreases when false positives increase (these are called Type I error in statistical hypothesis testing). Recall decreases when false negatives increase (there are called Type II error in statistical hypothesis testing).

You can read the complete article here.

Precision and recall can also be easily understood from this image below

Precision & Recall explained

Precision and recall are inversely proportional both of them can not be improved together.

Implementing search evaluation tool kit

If you still have patience, read below how to use event tracking for precision measurement using Google Analytics.

Implement GA event tracking for each of the position clicked in search results ( implement in both on auto-suggest and search result page)

Code will look something like

Standard code — _trackEvent(category, action, opt_label, opt_value, opt_noninteraction)

Once implemented you can easily track # of people clicking at each position.

If auto-suggest search precision is to improve, % of clicks at top positions {1,2,3} should improve.

To measure improvement in performance of search results on search result page (SRP) number of PDP (product detail page) {this can be ad-page for classifieds, a song page for an online music site and product page for e-commerce) originating from search is another metrics which can be tracked.

To summarize

  1. Track % of clicks at top positions {1,2,3}
  2. Track # of PDP originating from search result page (SRP)
  3. Track position of click events from search result page

Now to measure precision one may adopt either of the two approaches

a) Before and after — implement change to 100% of users and measure CTR and clicks at each position before and after each version/change is implemented.

b) A/B test- implement new search for test group but control group can still remain on older search. Difference in click distribution for test and control group will display improvements.

This methodology is easy, fast and effective way to measure precision.

Measuring recall is very difficult by client side events, for this its best to do internal evaluation taking help from different team members who have deep understanding about underlying data (songs, products etc) and have prior knowledge of the expected outcome

eg “mobile lmuia 720” should show results of “ Nokia Lumia 720” or

“chala jata hoon” should be able to display a song “chala jata hun”

Document should cover

  1. Search term
  2. Expected search result
  3. Actual search result
  4. Was search able to provide correct result from the search term (Y/N)

I could not find an easier and faster approach to measure search recall, open to learn if there is faster way to measure search recall.

Hope the article is useful. You may reach out to me for any clarification.

Image credit — wikipedia and