Improving search relevance — but without hosting new models

Published in

Data Science at Microsoft

6 min readJul 20, 2021

If your business relies on customers finding the right products on your website, you may want to optimize your search engine. For example, suppose you are running a startup that has hundreds of products listed on your website. When a potential customer visits your site, you’d like them find what they’re looking for quickly and seamlessly so that they are more likely to make a purchase. This challenge is addressed by search relevance, i.e., the quality of results in response to a searched phrase (via voice or text entry).

To illustrate, suppose that you own a café that primarily sells coffee beans but also sells coffee-related merchandise, including books. Now suppose that one of the most common search phrases on your café’s online shop is “best espresso.” Because your primary product consists of coffee beans — and because few customers buy your books on coffee — ideal search results are likely to be your top-selling varieties of beans, not books. So how do you help customers get to where you think they want to go — and where you can be of the most service to them — regardless of what you’re selling? The answer is search relevance.

This article demonstrates a straightforward way to optimize search relevance using the Azure Cognitive Search (ACS) service. If search relevance and ACS are already familiar, you may wish to go “directly to the code” in this online GitHub tutorial.

The Azure Cognitive Search service — how it works

For a given search term, ACS ranks an article’s relevance. In this context, an article is any online document managed by your business that external users can access. An article could be a product detail page or a self-help page. Relevance measures how well an article is associated with a search term. Relevance scores may rely on an article’s popularity, such as the number of clicks it receives and its text contents, including its title and primary keywords. Such content-level attributes are often beneficial for improving search relevance.

For example, Microsoft Docs is a repository of tutorials, documentation, and self-help pages that uses ACS to compute search relevance and retrieve meaningful articles in response to searched phrases. Searching “power bi” results in relevant articles such as developer documentation and an overview of the Power BI service. These particular articles are shown because of a combination of how often they have been clicked on, the thumbs-up counts they’ve received, and how their respective titles, descriptions, and keywords correlate with the phrase “power bi.”

In this scenario, relevance may be mathematically expressed as:

relevance(article | search term) =

title_wt * corr(article_title, keyword) + descr_wt * corr(article_description, keyword)

On the left side of the equation, search term is the phrase that is searched, and article is a candidate document in the Azure Cognitive Search index. On the right side of the equation, the weights title_wt and descr_wt are numbers reflecting how important the title and description are, respectively. Finally, the function corr is a linear correlation of their word embeddings, meaning their values would show as a straight line if they were graphed together.

So what does this all have to do with optimizing search relevance? A practical challenge with the equation above is that title_wt and descr_wt are independent of a specific [keyword, article] pair. By default, the weights they represent apply to all search term and article combinations.

For example, when searching for “teams” on Microsoft Docs, one of the returned items is the title of this admin documentation page for “Microsoft Teams admin documentation.” The title alone of the resulting document may be less relevant, however, than its description “Learn how to roll out and manage Teams, and prepare your users for Teams,” especially for someone looking to get started on Teams rather than wanting to know how to administer Teams. This example shows that because weights apply to all [search term, article] combinations, it is unclear what a “best” set of weights should be.

This challenge is further complicated when 1) a search index has thousands of articles and 2) there are additional article attributes to consider as well, such as page views, key phrases, or word count. So, the fundamental question is, “How do I choose the appropriate weights for attributes like title and description?”

It is possible to manually set these weights through the ACS scoring profile interface. If you have an Azure account, this process is as easy as visiting the Azure portal, accessing the search instance, and manually changing the profile’s weights. At the present time, ACS does not offer functionality to determine whether a set of weights is optimal. To tackle this challenge, a principled and automatic way of choosing the weights is presented below. Note that the following approach does not require developing or hosting new models or tools.

Step 1: Choose your metric. If your primary goal is to increase the click-through rate (CTR), then historical CTR at the [keyword, article] level is a great place to start. On the other hand, if engagement is a more relevant metric, then considering how long a user stays on a page (known as dwell time) can be useful. Finally, a composite metric like CTR + dwell time may be more appropriate, depending on your business scenario. Additional ideas may be derived from this documentation on search-based KPIs.

Step 2: Gather historical data to form a judgment list. Apply your metric to historical search data to determine the best-performing search results. For example, suppose your process revealed the top 100 most frequently searched terms and their respective five best-ranked articles. These 100 x 5 = 500 [article, search term] pairs can act as the ideal results to tune weights like title_wt and descr_wt mentioned above. The collection of such pairs is known as a “judgment list” because it forms a guide for tuning the general search experience. In practice, the size of the judgment list is determined by factors such as the number of business-relevant search terms and the volume of historical data.

Step 3: Offline optimization. Because the Azure Cognitive Search mechanism that determines relevance is a proprietary “black box,” we cannot directly apply classical optimization techniques (like gradient descent). Using the judgment list, however, enables application of a clever trick: different combinations of weights like title_wt and descr_wt can be applied to Azure Cognitive Search to see how well the articles it produces match their respective ranking in the judgment list. One standard method to evaluate matching is to apply normalized discounted cumulative gain (NDCG). An NDCG score of 1 is achieved when the set of weights used in Azure Cognitive Search results in a perfect match for all [article, search term] rankings. To implement this optimization, sequential-based methods like Bayesian optimization can be applied.

As a tutorial, this public github repo illustrates how to perform steps 1–3 above.

Step 4: Live A/B testing. Equipped with NDCG-optimized scoring profile weights, it’s time to put this newly generated scoring profile to the test. One strategy to conduct an A/B experiment is to expose half of users to search results driven by the previous scoring profile (the control) and others to that driven by the newly optimized scoring profile (the test). Azure Front Door may be used to split users into such control and test groups. Your search traffic volumes can help determine the duration of this experiment.

Checking whether you’re on the right track

It’s a good idea to compare your optimization results to a baseline NDCG. As demonstrated in the tutorial, if the judgment list or number of iterations is too small in Step 3, then this metric may oscillate about the baseline. As also shown in the tutorial, with a sufficiently large judgment list and enough optimization steps, average NDCG eventually exceeds the baseline and reaches convergence. See the figure below for an example. In this case, the best set of weights can be used for a scoring profile experiment as described in Step 4.

Figure 1: Example optimization history demonstrating that optimization exceeds the baseline and tends to stabilize. The total number of optimization trials is 250. The maximum observed average NDCG in offline optimization (Step 3) is 0.54 and the baseline is 0.46, resulting in a 17-percent gain in NDCG. A rule of thumb is that an NDCG gain of at least five percent is sufficient to warrant an online A/B experiment as described in Step 4.

Conclusion

This article has aimed to provide an informative overview of how search results may be optimized using Azure Cognitive Search. This approach does not require any extra tools (beyond open source Python packages) or hosting models. Interested readers are encouraged to leverage this tutorial and conduct subsequent A/B experiments to improve their customers’ search experience.

Ivan Barrientos is on LinkedIn.

Improving search relevance — but without hosting new models

The Azure Cognitive Search service — how it works

Checking whether you’re on the right track

Conclusion

Written by Ivan B