Building an Autosuggest Corpus, Part 1

In the last post we covered the goals of our autosuggest product and different measurement strategies. In this post we’ll talk about what an autosuggest system looks like, and start talking about how to build an autosuggest corpus.

Anatomy of Autosuggest

Here are the components of our auto suggest system:

  1. Autosuggest Corpus — A static dataset of suggestions that we can offer users. This is computed offline at a scheduled interval and loaded into the engine.
  2. Engine — A REST API responsible for Retrieval (getting the list of candidate suggestions for a given user input) and Ranking (ordering the retrieved suggestions).
  3. Client — The client application that runs in your browser, asks the engine for results, and displays them to the user.

We’ll be focusing on the corpus for the next two posts.

Finding Suggestions

So where do we look to find the suggestions we can offer to users?

Query Logs

The first option is to harvest your query logs, i.e. your record of what queries users have issued in the past. In my experience and in most benchmarks, query logs are the most effective source of suggestions.

But this path is not without its difficulties. You will need to have query logs in the first place, and have reasonable means of programmatically accessing them. Further, anytime you are leveraging user-data, you should strongly consider the privacy and security implications of doing so. Is it ok if my search history is used to provide suggestions for someone else? Do users expect that information to be private? Even if they don’t, do they sometimes search for potentially sensitive information? Can an adversary maliciously pollute your logs in order to game your autosuggest? What about misspellings? Potentially offensive material? You may find yourself spending a ton of time trying to scrub this dataset of PII, or just deciding it’s not worth the trouble/risk.

Documents

The second option is to harvest your documents, i.e. the content that users are ultimately searching for. On an e-commerce site this would be the items for sale. On our digital collections product, it’s the maps, photographs, and other cultural artifacts that institutions are making available for consumption.

The reason this source typically underperforms compared to query logs is that your users and your documents are usually speaking different languages. My favorite example from my time at Etsy was a shop that was selling beautiful handmade cloches (a type of hat). This tremendously talented seller was so deep in the world of millinery (hat making), that it didn’t occur to them to tag their items with the term “hat”. We did a poor job of handling this discrepancy at the time so it was hard for them to get found. The people or processes that produce content are usually subject-matter experts compared to the average consumer. If you build your autosuggest corpus from your documents, you may find that it’s filled with words your users aren’t likely to type, and it may be missing some important terms that they are.

Ceci n’est pas une hat

Now the astute reader will probably ask, what do you mean it underperforms? Underperforms on what metric? This brings us back to the whole prediction vs performance dichotomy. You can probably expect that a document-based autosuggest will have a lower Acceptance Rate, but users that accept suggestions may ultimately be more successful (i.e. higher CTR, fewer searches with empty results, etc). After all, if suggestions are coming from the documents themselves, the search results are likely to have plenty of stuff to look at. Although I’ve historically been biased towards query logs as a source, I’ve spoken to organizations that have run successful experiments showing that document-based suggestions are better for their bottom line.

It’s a choice.

There’s no right answer here. What makes sense for you will depend on your content, the goals of your system, the types of users you have and their expectations, your privacy policy, etc. You’ll have to weigh the trade offs of each approach and if both are viable you may want to compare their outputs or even their performance using live AB experiments.

In our case, we used our documents. Patron privacy is a critical requirement for cultural institutions, a fact that we deeply cherish and respect. Often these institutions don’t have query logs or getting access to them is tremendously painful. So query logs were, by default, not an option.

Coming Up

The nice thing about query logs is they are already split out into bite-sized queries users might type. One of the challenges with using your documents is turning a blob of data into a small set of discrete suggestions. In the next post, we’ll talk about some simple techniques to extract suggestions from your data.