A typical approach for processing search queries is to retrieve a set of matching documents and then rank them with a relevance scoring function. This simple approach generally works well for unambiguous, specific search queries.
But sometimes this approach breaks down. When a search query is broad (e.g., “shirts”), it isn’t clear how to decide which matching results are the most relevant ones. Even worse, when the query is ambiguous (e.g., “mixers”), it isn’t even clear how to determine the matching results, let alone rank them.
Recognizing When Search Results Need Diversification
Can a search engine automatically determine when a search query is broad or ambiguous intent? No approach is perfect, but here are some useful signals:
- Number of results. Specific queries tend to have small result sets. Conversely, broad and ambiguous queries tend to have large result sets. But a large result set may simply reflect an aggressive matching strategy. A more nuanced approach is to count the results with high relevance scores. If this number is high, then the query is probably broad or ambiguous.
- Variance of results. A stronger signal than the size of the result set is its variance. This variance can be computed from pairwise result similarity (e.g., cosine distance using a word embedding model), or from a histogram that summarizes the result set (e.g., the entropy of the category distribution). A high variance indicates a broad or ambiguous query.
- Distinctiveness of results. Another signal is the distinctiveness of the results relative to those of the overall document collection, typically measured using Kullback–Leibler divergence. For a deeper dive into this and related approaches, I recommend Claudia Hauff’s dissertation on “Predicting the Effectiveness of Queries and Retrieval Systems”.
- Query analysis. Short search queries tend to be broad, and they are also more likely to be ambiguous. Processing the query with a part-of-speech or entity recognition tagger can yield a more precise analysis. Hauff discusses these kinds of strategies in her section on pre-retrieval predictors. A more modern approach would take advantage of word embeddings, e.g., comparing the query with a collection of queries of known specificity.
- Historical searcher behavior. For frequent queries, the search engine can learn from historical searcher behavior. Specific queries tend to have high click-through rate, and the clicks tend to be top-ranked results. In contrast, broad and ambiguous queries have lower click-through rates and fewer clicks from top-ranked results. Broad and ambiguous queries also have higher rates of pagination, query refinement, and query reformulation. Finally, it’s possible to use labeled queries to train a machine learning model that recognizes broad and ambiguous queries — though any approach based on historical searcher behavior is vulnerable to presentation bias.
Broad Queries vs. Ambiguous Queries
All of these signals are ways to identify broad and ambiguous search queries. But these two classes of queries have important differences.
Broad queries are unambiguous but underspecified. For example, the broad query “shirts” includes shirts for men, women, and children; t-shirts, polo shirts, and dress shirts; shirts of all colors and materials; etc. In contrast, “mixers” is ambiguous because it could denote kitchen appliances, sound equipment, or several kinds of industrial machines. All shirts fall into the same general class, but the different kinds of mixers fall into distinct classes.
Some signals that can help distinguish broad queries from ambiguous ones:
- Modality of distribution. The results for a broad query center around a single mode that represents the “average” result. In contrast, an ambiguous query returns a mixture of results with two or more modes. There are various statistical tests to measure the modality of a distribution.
- Top-level vs. lower-level category variance. A broad query generally has results within a single top-level category, e.g., shirts are all in clothing. The results vary within the children of that top-level category. In contrast, results for an ambiguous query split among multiple top-level categories, e.g., mixers are split among kitchen appliances, audio equipment, etc.
- Entity recognition. Entity recognition for an unambiguous query typically yields a single sequence of tags with a high confidence score. In contrast, the lack of a single dominant tag sequence indicates an ambiguous query.
- Historical searcher behavior. If the query is frequent, then it’s possible to apply previously cited statistical tests for the modality of distribution to the results that searchers have historically engaged with.
Search User Interface Implications
All of the discussion so far has been about recognizing broad and ambiguous queries. But what should a search engine do differently if it does recognize such a query?
Disambiguate Ambiguous Queries
If a query is ambiguous, the search engine cannot reliably determine the searcher’s intent. The best way to resolve this ambiguity is through a clarification dialogue. The search engine should present the searcher with unambiguous queries that represent the most probable possible interpretations, with examples to communicate the distinct alternatives. In our “mixer” example, the suggested queries might include “kitchen mixer” and “audio mixer”.
Refine Broad Queries
If a query is broad, then the search engine should suggest refinements that guide the search towards more specific queries. These typically include category suggestions, such as refining from shirts to t-shirts, dress shirts, etc. They may also include faceted refinements that suggest useful attributes to narrow the result set.
It’s important to remember that disambiguation comes before refinement. If a query is ambiguous, the search engine’s first priority is to disambiguate it. Then, if the disambiguated query is still broad, the search engine should help the searcher refine it.
Many search queries only require the traditional approach of ranking a set of matching results. But some queries require a more complex approach, either because they are broad or ambiguous. It’s important for a search engine to detect such queries, as well as to distinguish broad queries from ambiguous ones. Fortunately, there are a variety of signals that search engines can use to do so. Doing so allows the search engine to help the searcher disambiguate or refine the query as appropriate.