Bootstrapping Autosuggest

Autosuggest is critical to the modern search experience. Users expect it, depend on it, and it’s particularly important on mobile devices where typing is cumbersome and error-prone. It’s also a ton of fun to work on because it’s a complex product that blends a bunch of different disciplines: interaction design, data structures, front-end engineering, information retrieval, natural language processing, AI, etc.

This is the first post in a series describing the work that I did to add autosuggest to one of our products. I’ll talk about the techniques we employed, but along the way I will also spend some time speaking more broadly about the challenges involved and the spectrum of approaches people use to solve them.

Goals

Before you write a single line of code I think it’s important to establish what your goals are for the product. Who is your target audience? What are they trying to do?

At Related Works we’re currently working on a platform to host digital collections for libraries, museums, archives, and other cultural institutions. You will find all sorts of treasures in these collections: historical maps, photograph of the famous and even the mundane, digitized manuscripts and documents, audio and video from important events, etc.

Items you may find in Digital Collections

For this iteration of autosuggest our target audience is the public: a high-school student doing a history project about the Croton Dam, a podcast producer working on an episode about the LGBTQ community in NYC in the 60s, a researcher writing a book about the Syrian colony in lower Manhattan at the turn of the century, etc. Searches are usually oriented around a particular theme like “Worlds Fair”, “Gay Pride 60s”, etc, but these collections are filled with unique and unimaginable things, so users aren’t quite sure what they might find.

You could describe these as exploratory searches: the user doesn’t have a particular objective and is using search to get a sense for what’s available and browse. This is as opposed to known-item searches where the user knows exactly the document or object they are looking for. Our goal, then, is to make it easy for users to dive into fairly broad searches that will offer them a ton of material to look at. If you were designing autosuggest for a power-user tool, like say Asana, your goal may be to direct users to individual items/objects/tasks, or even to suggest some advanced query syntax. I can imagine at some point we will offer a more specialized autosuggest on the internal version of our platform that institutions use to manage their content.

Sentry is a good example of autosuggest as a power-user tool

The other thing to bear in mind is that latency and UX is especially important for our target audience. We may be only a single-stop on a very long research journey for these folks. If our experience is slow or janky, they will probably move on. Users of internal/power-tools are usually more patient as long as you can point them in precisely the direction they need/want to go in.

Measurement

Basics

You can’t talk about goals without talking about metrics. How do we know if we’re actually getting anywhere? Not all goals are easily measurable, and even if they are, measurement is often a subtle and troublesome art. But that shouldn’t dissuade us from trying to understand how we’re doing.

Academic research tends to benchmark autosuggest engines using Mean Reciprocal Rank (MRR). MRR is a score between 0 and 1 that measures how high in the results the ultimately selected query was, on average. The higher the MRR, the better. Let’s look at an example:

Calculating Reciprocal Rank

Let’s say the user typed the letter “c”, and selected “cats”, the third result. The reciprocal rank for the query would be ⅓. If they had selected “caterpillar” instead, the reciprocal rank would have been 1. The higher the rank, the higher the reciprocal rank. To understand how you are performing across many queries, you take an average of the reciprocal rank across that population.

I think MRR is useful and I would recommend measuring it, but it’s not the first metric I would look at because it’s a bit opaque. It’s useful for comparing different autosuggest strategies, which makes it useful for benchmarking and offline evaluation, but it doesn’t really give you a visceral sense for what’s happening on your website/app. And if you are adding autosuggest to your site for the first time it’s not really an option.

The first thing I would look at is what I call Acceptance Rate: the percentage of searches that come from autosuggest. If 60% of searches are coming from autosuggest, your autosuggest is probably pretty useful.

Another useful measure is Average Keystrokes (or Average Prefix Length). Let’s say a user typed “so” and then selected the query “solange”. The prefix length would be 2. Averaging this figure across many searches lets us understand how many characters users typically have to type before finding something useful. The fewer characters, the better.

Beyond the Search Threshold

All of the metrics we’ve talked about so far are focused on prediction: how quickly can we figure out the query you want to type, and how high in the rankings can we get it? But what if the query you want to issue is actually a dead-end?

As designers, product owners, cultural stewards, etc our ultimate goal isn’t for users to select precisely the query they had in mind, it’s to help them find something useful. So I like to think about autosuggest as part prediction, and part recommendation. It’s also why I prefer the term autosuggest rather than autocomplete. It’s a short-cut, but it’s also an opportunity to steer users in a good direction and even to help them learn what you have to offer.

To that end, I think it’s useful to think about the outcomes after a user makes an autosuggest selection. If your endgame is easily measurable (like say, purchases on an e-commerce website), you can try measuring how often people that accept autosuggest selections make it to that success stage. Measuring any engagement in search (i.e. clicks or CTR) is not a terrible option, but clicks can be notoriously misleading (a topic for another blog post). If measuring engagement/success is hard, you can try measuring the inverse: how often are users that use autosuggest ending up on a page with little or no results? How often do they just abandon the site without doing anything else? In our case success isn’t as obvious, so we’ll be leaning heavily in this direction.

Asking Humans

We are big believers in UX research at Related Works. For autosuggest I think these studies would be useful for honing in on the interactions, which can be surprisingly tricky, especially if you are working on the super-user, precise, advanced-operator style of autosuggest.

I wouldn’t suggest UX research for understanding the effectiveness of one autosuggest algorithm over another (or at least not in isolation). Most search experiences have a long-tail of unique or infrequently issued queries so it’s remarkably hard to get a sense for your performance from just a handful of cases.

Coming Up

In the next blog post we’ll break-down all the technical components in an autosuggest product and start talking about building a suggestion corpus.

Feel free to reach out with suggestions, questions, corrections, insults, accusations, etc. We are, after all, tiny meaningless specks in a vast and incomprehensible cosmos.

If you’re interested in reading more I would recommend Daniel Tunkelang’s work on the subject, and this survey as good jumping-off points.

Like what you read? Give Giovanni Fernandez-Kincade a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.