Searching for Goldilocks

Many search queries don’t have a single right answer or even an objective best answer. For example, if you’re looking for mailboxes or pictures of kittens, then the best answer — if there is one — is highly subjective. For searches like these, the ideal search experience is a set of results that you can scan, compare, and ultimately narrow down to a favorite or small set of favorites.

Importantly, the goal of the search engine is not to minimize the time you take to find a single result, but rather to help you effectively and enjoyably explore enough options to satisfy your information need. For exploratory search, unlike for known-item retrieval, engaging with more results should be a sign of a positive search experience.

Because there is no best result, even an ideal ranking algorithm will assign practically indistinguishable scores to the top-ranked results. Moreover, the ranking function may harm the search experience by obscuring the diversity of relevant results.

Exploratory searches call for diversity in the search results. But what is the optimal level of result diversity? How, like in the story of Goldilocks, do we provide diversity that’s not too much or too little, but just right? And how do we measure our success?

I can’t answer this question precisely, but I’ll try to articulate a framework that offers guidance.

Query Interpretation

While some amount of diversity in search results is a good thing, we shouldn’t conflate it with query ambiguity. For example, a search for images of “bows”, could be an attempt to find decorative knots, weapons for shooting arrows, or people bowing. But no reasonable searcher would looking for more than one of the above.

A result set that spans all of these disjoint query interpretations in the name of diversity will be mostly wrong. Some searchers will be lucky and find relevant results; others will reformulate their queries to clarify their intent. But many searchers will simply quit in frustration.

Result diversification is not an effective solution for query ambiguity. Rather, the search engine should guide searchers to unambiguous queries through an interactive search experience. That interaction may be a clarification dialog after the user has entered a query, like Wikipedia’s disambiguation pages. Or it may be part of the autocomplete mechanism, e.g., “jumpe” on Amazon suggests “jumper in Baby” and “jumper in Women’s Clothing” as completions.

Ultimately, the search engine should guide the searcher to a set of search results that may be broad — clarifying that a bow is a decorative knot still leaves lot of room for exploration — but is consistent with a single query interpretation.

Too Much

Let’s now assume that the search results are consistent with a single query interpretation. There may still be too much diversity, even to support exploratory search intents. How much diversity is too much?

If each search result feels like a unique snowflake, then even an exploratory searcher will struggle to scan and compare them. It’s a form of the curse of dimensionality: the searcher is trying to assimilate set of results into a cognitive judgment space, and he or she can only do so if the space doesn’t have too many dimensions. If the results are too diverse (e.g., a search for recipes), the searcher will reach cognitive capacity after scanning a relatively small number of results, prematurely settling for the best choices among them.

In other words, the searcher may find a result quickly, but the exploratory search experience will be neither effective nor enjoyable.

How should a search engine handle too much diversity in the result set? It can suggests refinements to guide the searcher to a query with less diversity. Even better, it can associate those refinements with clear information scent. Both of these approaches guide the searcher to a more manageable amount of diversity.

Too Little

Let’s turn to the other extreme: what is the risk of having too little diversity?

As we discussed earlier, the ranking function can quash diversity in the top-ranked results, since similar results tend to have similar scores (aka the cluster hypothesis). For example, a search for “shirts” on an ecommerce site shouldn’t give the impression that the site only carries women’s blouses.

The risk of too little diversity is that searchers will never discover the content they’re looking for. At best they’ll only discover certain segments of the relevant space. Moreover, the lack of variation means they’ll quickly get bored as they scan the results. When again means that they’ll stop after after scanning a relatively small number of results.

A search engine can’t necessarily prevent searchers from specifying a low-diversity result set, especially if the searchers decide to type long and highly specific queries. But the search engines can avoid presenting search suggestions that lead to low-diversity result sets, particularly in autocomplete. The paths of lowest friction shouldn’t quash diversity, at least not for exploratory search.

Just Right

Like Goldilocks, the searcher wants an amount of diversity that’s not too much, not too little, but just right. Enough diversity to make the exploration interesting, but not such much as to bring on the curse of dimensionality and the ensuing cognitive overload.

How much diversity is just right? Surprisingly, I couldn’t find any definitive answers to this question in the information retrieval literature. But fortunately my friend Jeremy Pickens pointed me to a paper that Pengfei Zhao and Dik Lun Lee presented at last month’s SIGIR 2016 conference, entitled “How Much Novelty is Relevant?: It Depends on Your Curiosity”. That paper, while tackling a somewhat different use case for recommender systems, pointed me to the Wundt Curve.

Wilhelm Wundt, one of the founding figures of modern psychology, observed a non-monotonic relationship between complexity and stimulation. People are more stimulated by greater complexity up to a certain point, after which the experience becomes overwhelming and the stimulation diminishes. He characterized this relationship between complexity and stimulation as a bell-shaped (or inverted-U) curve, called the Wundt Curve.

Wundt died in 1920, a bit early to apply his research to modern information retrieval. But I believe he would have instantly recognized our exploratory search scenario as a special case of his curve. Optimizing the search experience feels a lot like the problem of finding the peak of the Wundt Curve — the amount of stimulation that is just right.

Finding the Peak

How do we find the peak of the Wundt Curve? We need to quantify the diversity of the result set and then vary it until we find the amount that maximizes users’ effective and enjoyable exploration of the results.

There are lots of ways of defining search result diversity — dare I say, a diversity of definitions. Rodrygo Santos, Craig Macdonald, and Iadh Ounis have written a great survey of the topic. No definition will be perfect, but we need to settle on a simple metric (e.g., average pairwise dissimilarity using a vector representation of the documents) that will serve to estimate the amount of diversity in the results shown to the user.

Now that we have a definition of diversity, how do we vary it? A common technique is to initially rank results with a scoring function and then rerank them to increase diversity. No approach is perfect — as Santos et al point out, the problem is NP-hard. Still, reranking provides an effective lever.

Finally, we have to decide what we are optimizing for. Defining search happiness in general is beyond the scope of this post. But a proxy we should track the number of items with the user engages. For non-exploratory searches, fewer is better. But for exploratory searches, engaging with more results can be a sign that the user is enjoying the process. If that’s the case, then maximizing the number of results with which the user engages could represent the peak of the Wundt Curve.

Now What?

As per my earlier disclaimer, I don’t have a silver bullet here, just a framework to offer guidance. Defining diversity, measuring search happiness, and providing just the right of amount of the former to optimize for the latter — these are all hard problems. But I hope I’ve succeeded in calling your attention to some of the unique challenges of exploratory search, and that the framework I’ve offered proves useful to those of you working on search engines.

Remember, ranking is important, but explorers deserve love too.