Autosuggest Ranking

Supercharge your search box with Statistics, Supervised Learning, Bayesian Inference, PCA, Clustering, Word Embeddings, and more! Oh my!

Ebisu, Japanese god of luck, hard at work ranking your fortune. Source.

In the last post we talked about retrieval: given a prefix, producing a list of candidate suggestions that we can offer to users in a search box. In this post, we’ll talk about ordering this list of candidates.

After the retrieval phase returns a set of candidate suggestions, we apply a ranking algorithm to resort them.

In practice, most production systems rank in two phases. The first phase is usually very cheap and may be based on a single static score or a handful of factors. It’s often part of the retrieval pass. A second phase of ranking is then performed on the top N, usually leveraging more complex models and a larger set of features. Often this second phase is when personalization is incorporated.

Where the line is drawn between the two is a function of performance and technology. It’s usually too expensive to perform complex ranking on a large set of candidates, e.g. the full set of suggestions that start with a short prefix like “b”. That first phase of ranking is usually happening inside a generalized search engine like Solr or Elasticsearch, and it’s difficult to run arbitrary code in those contexts. Further, if you are performing retrieval using a system that supports early termination, like the Finite State Transducers (FSTs) we discussed in the previous post, it’s tremendously advantageous to leverage a static score during the first pass.

Goals

As we discussed in the first post from this series, the goals of an autosuggest system are usually a mixture of prediction and performance. We want to suggest queries that users are likely to accept, but we should make sure users are going to have an engaging experience if they issue those queries. Suggesting a very probable query that has no results, for instance, is a tremendous bummer.

As we talk about optimizing ranking, you can conceptualize the task as finding some ordering that maximizes the right blend of prediction and performance for you. Or you can simply try to enforce some minimum threshold of performance (e.g. filter out any queries that yielded no results), and focus on prediction. There’s a lot of ground to cover on the prediction side, so we’ll assume that strategy and focus on prediction for the sake of this conversation.

Two Paths

Practically speaking we’re talking about designing a ranking function, which takes a prefix, a suggestion, and provides a numeric score we can sort by (usually descending):

A ranking function takes the prefix and suggestion as input and produces a numeric score.

There are other ways to design the problem, but this framing is common. From a prediction perspective, you can think of the objective as estimating the probability P(suggestion|prefix). A simple estimate of that probability might look like:

A simple calculation of the probability of a query given a prefix.

Every candidate suggestion will have the same denominator, so essentially we’re sorting by popularity, which turns out to be a surprisingly effective baseline, and a good choice for static scores used in first-pass ranking. If you don’t have query logs you can start by measuring popularity in your corpus of documents [1].

Of course, popularity isn’t everything. We’re not considering, for instance, how probable a query is for a given user. What if everyone is into “high waisted shorts” but you typed “h” because you really wanted a “hootie and the blowfish t-shirt”? If we really want to get you that Hootie, our job is to come up with a richer estimate of this probability.

A probabilistic approach to the ranking problem tries to calculate these probabilities directly based on observed data, or often leverages a statistical model to estimate them. The popularity approach described above falls into this category.

You can also formulate this as a supervised learning problem. This might take the form of training a binary classifier with (prefix + suggestion, label) pairs, like (“h” + “harry potter”, 1) or (“h” + “hella cool”, 0). The implication is that “harry potter” is a good suggestion for the prefix “h” , and “hella cool” isn’t. After showing the model a ton of these examples, it will learn to separate one class from the other, and given a new prefix + suggestion, it will generate a numeric value for how good or bad it is:

An example LTR model takes as input a prefix, a suggestion, and produces a numeric score we can use to rank.

The application of supervised learning to ranking is often called Learning to Rank (LTR).

Usually training data is harvested from user behavior. Say a user issued the query “hella”. For each prefix of “hella” we would create a positive example associated with that query, so [(“h” + “hella”, 1), (“he” + “hella”, 1), (“hel” + “hella”, 1), …]. For each prefix, we might retrieve the set of candidates from our first-pass system and use them as negative examples. So for instance, if “hello” was a candidate for “h”, we would generate the training instance (“h” + “hello”, 0)[2].

Typically you don’t literally give the model the prefix and suggestion — you show the model a vector of features about the two. You can include features about the prefix (how long was it?), about the suggestion (how popular is it?), about their intersection (what’s the edit distance between the two?), and about the context within which the suggestion was shown (what time of year is it? What was the user’s geographic location?). A feature vector might look like this: {prefix_length: 2, suggestion_volume: 200, edit_distance: 10, ….}.

Both the probabilistic and LTR approaches end up utilizing a lot of the same factors, and contextual features are the most popular, so that’s what we’ll focus on for the remainder of this post.

Generally a probabilistic approach is easier to get started with, has intuitive explanations, and requires less data. LTR may ultimately yield better performance at the cost of complexity and possibly opacity. The other big advantage of LTR is that it’s straightforward to incorporate many factors in a principled way — just add them to your feature vector.

Query Logs, User Data, & Privacy

In a previous post, we discussed the decision between using query logs or documents as the source for your corpus of suggestions. Even if you chose documents, you can still leverage query logs to inform suggestion ranking.

A lot of the research and techniques we’ll discuss today depend on query logs. It’s hard to predict when you have nothing to predict against. But as the world learns about the crimes of Cambridge Analytica and the tech industry braces for the impact of GDPR, it would behoove us to address the responsibility we have while leveraging user data for a project like this. You should give users transparency (i.e. inform them how their data is being used) and control (i.e. give them the ability to decide how or whether or not their data is used). You need to make sure their data is securely stored with access restrictions, auditing, and you should offer them the right to be forgotten. Addressing this in a comprehensive way is beyond the scope of this post, but suffice to say it’s something you should be thinking about.

Time Features

When we talked about sorting by popularity, there was implicitly some notion of time. Are we talking about popularity across all time? The last month? Some other time-frame?

We’re trying to predict the probability P(suggestion|prefix), ideally at the moment the user is typing the prefix, which will be at some point in the future after we’ve designed/trained our ranking function. Essentially, we’re trying to forecast the future volume of a given suggestion, and there’s a large body of statistical techniques we can draw from to accomplish this.

Recency

Recency refers to the idea that recent events are more important than the distant past, or that they are better indicators of the future. A year ago “Colin Powell”, “Colin Kaepernick”, and “Card B”, had virtually indistinguishable query volumes, but these days Cardi B is reigning queen:

Plotting the Google query volume of “Cardi B”, “Colin Kaepernick”, and “Colin Powell”. You can tell who’s making money moves.

How quickly things turnover depends highly on the nature of your product. Twitter trends change on the order of minutes or hours, but even slower-paced products tend to have recency-effects. You can think of Twitter’s original feed algorithm of sorting by time-descending as wholeheartedly embracing recency as a signal.

An easy way to incorporate recency is to only consider the popularity of queries in the recent past (say the last month or two). But by doing so, we would be throwing away all of the valuable data we have about long-term trends. How do we find a balance between recent events and the past?

A common approach is to use an exponential moving average or exponential smoothing[3]. The idea is pretty simple:

The formula for computing an exponential moving average. It’s recursive!

St is the smoothed (people also use the term estimated or forecasted) query volume at time . Time is defined as equal-sized steps, say days, months, or years. So if our steps were months, t=5 would mean the fifth month since the beginning of our history. Yt is the observed query volume at a given time step. Notice that St is recursive, i.e. it depends on St-1, so our current estimate depends on the previous estimate, which depends on the penultimate estimate, and so on until the start of time. We’re essentially encoding the idea of recency by giving every step backwards in time exponentially less impact on the current estimate, by a factor of (1-α). α is a parameter between 0 and 1 that lets us trade off between recent trends and the past. The higher the α, the more important recent events are.

How do you choose an α? You can try out a few values and see which one yields an St that’s close to the observed values Yt. This can be quantified using a number of metrics including the Sum of Squared Errors (SSE). There are also optimization techniques you can use rather than manually trying a bunch of α’s.

Once you’ve tuned α, you can use St in place of query volume to calculate your probabilities:

Using predicted query volume to compute suggestion probabilities.

Seasonality

Plotting the Google search volume of “mothers day” against “mugs”.

Seasonality is another important temporal feature that is often useful in ranking functions. If you’re running an e-commerce shop, and it’s almost Mother’s Day, you might want to show folks more Mom stuff. Seasonality can operate on many time scales — holidays occur on yearly cycles, but you could just as easily observe monthly trends (people tend to move on the first of the month), weekly trends (people browse on desktop less on weekends), etc. Again what’s important really depends on your users.

To model this you could use triple-exponential smoothing[3], which really just builds on vanilla exponential smoothing to incorporate an estimate for the seasonality on a predetermined time-scale (yearly, monthly, etc). Going over that in detail is beyond the scope of this blog post, but here’s an AMAZING walk-through.

Again you can use these smoothed volumes directly to estimate your suggestion probabilities.

LTR

If you’re taking an LTR approach, it’s easy to just incorporate raw volume figures for different time scales as model features, e.g. {volume_last_month:200, volume_last_year:150, volume_this_month_last_year: 300, …}. But if you take the time to model recency or seasonality effects, it’s often effective to incorporate the predicted volume as a feature[4]. Probably because you’ve gone through the trouble of modeling the temporal aspects directly and finding good parameters, so the model doesn’t have to do that while juggling a ton of other factors.

User Features

As we alluded to earlier, which query a user is likely to issue is widely different for different types of users. So it shouldn’t be surprising that user-based features are some of the most effective you can leverage in an autosuggest ranking function.

Demographics

Folks will often start on user-features by looking at demographics — coarser level criteria that group together lots of users like location, gender, device-type, etc.

The probabilities of issuing the queries “imdb” versus “instagram” on Yahoo, broken down by age and gender. [2]

Canadian Thanksgiving (eh!) is in October, while American Thanksgiving is in November. It’s particularly popular to use coarse location (city, country, etc) as a feature because it can be extracted from HTTP headers or inferred from IP addresses (while still being privacy-friendly), so you don’t need rich user profiles, and it’s a signal that can help you separate local tastes, trends, and seasonal effects like when to show that Thanksgiving stuff.

If you’re taking a probabilistic approach you can simply segment your probabilities or forecasts by demographic. So if the user is in the US, only look at US query volume. Or you can combine this segmented probability with global trends using Bayes rule. We can treat our previously estimated probability (perhaps using temporal modeling) for the query “foo” as our prior probability P(“foo”), and we will now compute a posterior probability of the query “foo” given the user’s location is the USA, P(“foo”|USA):

Combining global suggestion probabilities with location-specific probabilities using Bayes Rule.

In this manner you can incrementally incorporate new factors in a principled way. Simply take your previous estimate for a query, treat it as the prior, and then apply Bayes rule using the new evidence. This is often called performing a Bayesian Update.

You will almost certainly run into situations where you get probabilities of zero — e.g. the query “foo” was never issued in Estonia. Zeros are a bad trip because they unreasonably discard the information in your prior. It’s not that a user in Estonia would never issue the query “foo”, you probably just don’t have enough traffic from Estonia to know. Realistically, the probability isn’t zero, it’s probably some small number we don’t have enough information to accurately estimate. Instead, you can use probability smoothing, which just turns those zeros into really small numbers.

On the LTR side of the equation, the most straightforward way of including user demographics is to make demographic-specific versions of existing features. So for instance if a query was issued globally 500 times, and was issued 200 times in the user’s location, you might up with something like {global_query_volume: 500, local_query_volume: 200…}.

Behavior

The next step in sophistication is considering how users behave in your product and using those signals to inflect your ranking function. This family of signals is naturally the most privacy antagonistic, so you should proceed with caution.

I would start by clustering users based on behavior — i.e. creating behavior-based demographics. For instance, you could create a giant matrix where the rows are users, and the columns are the terms from your inventory. Initially the matrix is all zeros. For each item a user interacted with, set the terms from that item to 1 for that user’s row:

A user-term matrix, which we’ll decompose and use to create user-clusters. Ara is apparently the only user into “camo”.

This is similar to a typical setup for collaborative filtering. If you have V terms in your vocabulary, you have a V-dimensional representation of a user. That’s probably very large and very sparse. You can use dimensionality-reduction techniques, like Principal Component Analysis (PCA) to project this representation down to a smaller space with say few hundred dimensions. Now with this much more compact representation of users, you can cluster them using something like K-Means Clustering. The cluster a user belongs to can become a demographic that you segment by, use to update your probabilities, or use in cluster-specific model features , e.g. { user_cluster_query_volume: 500 }. These clusters should end up representing broad interests like separating folks that like “boho chic” from the people into “athleisure”.

You can just as easily look at query-terms instead of looking at item-terms (and a number of other dimensions). Both might be useful but the later, not surprisingly, tends to have more data you can leverage.

A more direct and personalized approach is to compute some measure of the similarity between a query and a user’s past behavior. A simple place to start is computing the ratio of query terms that were present in the user’s past queries:

A simple similarity measure is the ratio of terms in a suggestion that were present in a user’s history.

So for the query “harry potter tshirt”, if the user has searched for “harry potter” before but never “t-shirt”, the ratio would be ⅔. You can also compute a similar ratio using n-grams instead.

You can use this measure directly as your ranking function, which would be more of a heuristic approach, or incorporate it into your LTR model, e.g. { user_history_n_gram_similarity: 0.082 }.

You can consider the user’s history across a number of time frames — across all history, recent behavior, within the same session, etc. Often users will slowly evolve their query by looking at results and adding or removing terms from the query, a process called query reformulation. Through this process the user is learning the language of what’s available, and is learning the ins and outs of your search engine. Looking at in-session behavior allows you to pick up on query reformulation, but carries the danger that you will put too much stock in recent behavior and not understand when a user is changing directions completely.

Combining Queries

We briefly mentioned the problem of data sparsity. This is particularly troublesome for autosuggest because search is a long-tail game. While there is often a head of very popular queries, there is an enormous set of queries that are only ever issued a handful of times. For instance, on Etsy “harry potter” was a very popular query, but you can imagine that “harry potter t-shirt size small red” was very rare by comparison.

Rarity is difficult because it means our probabilities for that suggestion will be close to zero, or our LTR models will rarely see that suggestion. Yet those long-tail queries might be precisely what the user is looking for, especially in the midst of query reformulation.

Just as we clustered users, it can be useful to group together queries so we can share information about them, e.g. {query_cluster_volume_last_month: 1500}[6]. This can give long-tail queries an extra boost, and when combined with in-session personalization, can be powerful a mechanism to aid query reformulation.

If you can generate a vector representation or embedding for a query, you can simply run a clustering algorithm in that space to generate query clusters. The most obvious choices are:

  1. PCA — You could perform a matrix decomposition (similar to what we did earlier with users and items) where the rows are queries and the columns are users that issued them, the terms in the queries, or the items that were shown or interacted with.
  2. Neural Query Embeddings — You could leverage word vectors from a neural network model like word2vec or GLoVe, pre-trained or trained on your own query corpus. If a query has multiple terms the most straightforward way to generate a query-level embedding is to sum or average the vectors for each term in the query, but there are more principled approaches (like Facebook’s StarSpace).

You could also group queries by:

  1. Term Co-Occurrence — You could treat any queries that share at least one term as a cluster. This is roughly equivalent to doing PCA with query terms as the columns.
  2. Query Classifications — It’s often useful to classify queries into broad classes of inventory that the user might be interested in like “Clothing” vs “Furniture”. Query classifiers are useful for ranking, disambiguation, etc. If you happen to have a query classifier, this might be a useful grouping mechanism to experiment with.

Coming Up

Phew! That was a lot. I hope this whirlwind tour of autosuggest ranking was interesting, though I’m sure it can feel overwhelming. The best place to start is just using query popularity across a fixed time range. That’s more or less where everyone starts. I presented the factors/techniques roughly in order from coarsest/easiest to most-granular/sophisticated, so I would suggest slowly trying things in that order. You don’t have to go too crazy to get something decent.

In an upcoming post, we’ll talk about the autosuggest user experience.

Liked this post? Follow us on Medium. Need help with your autosuggest, search, or with data in general? That’s what we do! Get in touch (gio@relatedworks.io).

Acknowledgements, Endnotes, and References

A huge thanks to Steve Mardenfeld and Nishan Subedi for reading this unbelievably long post and giving thoughtful feedback. Y’all are the best.

Yet again, I would recommend Daniel Tunkelang’s work on the subject, and this survey was a critical source for this post.

[1]— The danger with measuring document popularity is that you are mistaking supply for demand. Just because you have a ton of documents that mention buttresses doesn’t mean your users are actually interested in them.

[2] — Milad Shokouhi. 2013. Learning to personalize query auto-completion. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval (SIGIR ‘13). ACM, New York, NY, USA, 103–112. http://doi.acm.org/10.1145/2484028.2484076

[3] — C. C. Holt. Forecasting seasonals and trends by exponentially weighted moving averages. International Journal of Forecasting, 2004.

[4] — Fei Cai and Maarten de Rijke (2016), “A Survey of Query Auto Completion in Information Retrieval”, Foundations and Trends® in Information Retrieval: Vol. 10: №4, pp 273–363. http://dx.doi.org/10.1561/1500000055

[5] — Cai, Fei & de Rijke, Maarten. (2016). Learning from homologous queries and semantically related terms for query auto completion. Information Processing & Management. 52. 10.1016/j.ipm.2015.12.008.

[6] — [5] demonstrates the impact of considering homologous queries, especially in combination with personalization. I’m generalizing this concept to any set of query groupings or relations that we can utilize to share information across queries.