If You’re Not Keeping Score, You’re Only Practising

Wilson Wong
Practical AI Coalition
3 min readJun 2, 2020

--

Search improvement is an empirical field of work. What this means is that we make headway in what we do through the analysis of observations, experiences and then acting on the findings. To illustrate this, consider you have just prepared green curry for the first time and you plan to sell it to make a living. You need to know if you have the ingredients and the cooking steps correct. What do you do? You can invite some friends over to try it out. During and after the meal, you can do a number of things; observe their body language, monitor for leftovers in their individual portions and casually quiz them on their experiences. What you are doing is essentially collecting the data you need to find out if the green curry was good and which aspects of the recipe or the cooking process need improvement. That, in essence, is empirical work.

Photo by Isaac Smith on Unsplash

Why is measurement and evaluation important?

In the context of search, it should not be hard to imagine why most, if not all research and development activities have to follow the search quality improvement lifecycle (aka continuous measurable improvement) above. The cycle essentially shows the relationship between measurement and evaluation with relevance improvement. Relevance improvement without proper evaluation is like shooting in the dark, and evaluation without relevance improvement is aiming without ever firing a shot. Measurement and evaluation is the stage where empirical work happens, where we gather data and advance our understanding of our search products. It gives us the ability to aim, fire and find out if we hit the target.

How to evaluate?

There are several ways of evaluating to see if our initiatives are hitting the target. Similar to the earlier green curry tasting example, we can have the search results rated manually, rely on data about the usage of our search products, and survey the customers for their feedback. It is important to note that neither of these approaches is adequate in isolation. Often, a more holistic view of evaluation which covers a wide range of approaches is required. For manual relevance judgements, tools are required to allow members in a team or the crowd to create test collections. These collections are used to establish judgment based metrics such as precision, recall and NDCG to quantify potential improvements to our search results, prior to releasing the revised algorithm to production. As for the online approach, tools for tracking, storing and managing query and click logs are important. They enable us to compute absolute metrics from click data to determine if our search results are helping the users to find what they need.

Why do we need an evaluation framework in a business where search is core?

There are many business benefits that come with a proper evaluation framework. For one, through evaluation, we will be able to find out the things that are not working properly and aspects of our search products that need fixing the most by our customers. These insights will enable us to target our efforts to where it matters most, and to better prioritise initiatives and allocate resources. Secondly, proper evaluation promotes transparency and confidence in the initiatives that we undertake. We use success metrics as yardsticks to find out the impact of our relevance improvement initiatives and to gauge the quality of our search results along the way. The metrics also allow us to communicate progress, learnings and successes.

Conclusion

You can keep cooking green curry all you want. However, unless you are measuring along the way, tracking customer feedback, and comparing with data from your previous attempts, you will just be practicing. In this article, we discussed some common approaches for measuring and evaluating the impact of search algorithm changes on relevance and effectiveness. Having a proper process to improve relevance through insights gained from measurement and evaluation may seem excessive to some, but it is a necessary enabler for the continuous improvement of search products.

--

--

Wilson Wong
Practical AI Coalition

I'm a seasoned data x product leader trained in artificial intelligence. I code, write and travel for fun. https://wilsonwong.ai