Autosuggest Retrieval Data Structures & Algorithms

From Brute Force to Tries, Automata, and beyond.

Giovanni Fernandez-Kincade
Related Works
15 min readMar 6, 2018

--

Picture of the Coulee Dam construction.

In the last post we talked about constructing an autosuggest corpus — the list of search phrases we can offer to users when they start typing in the search box. In this post, we’ll talk about building a search engine to serve up those suggestions.

Given a prefix, an autosuggest engine produces a list of completions.

You can think of any search engine as working in two phases:

  1. Retrieval — Given some input, find a suitable set of candidate results.
  2. Ranking — Given a set of candidates, order them.

We’ll be focusing on the former. In the case of autosuggest, the input is a prefix, like someone typing “b” in the text box, and the candidates are the suggestions from our corpus that begin with that prefix, like “building”, “buttress”, and “buffalo bill”.

It’s estimated that 10–15% of web search traffic is misspelled¹ and misspellings are even more common on mobile. Although it’s not necessarily a requirement, it would be nice if our autosuggest system had built-in spell-checking or typo-tolerance. So if I searched for “bitt” it should still retrieve “buttress”.

Preamble

This blog post is heavy on code, algorithms, data structures, and asymptotic analysis. If that kind of thing excites you, carry on!

Accompanying code can be found here.

Some of these techniques are simple enough that you may implement them yourself directly for production use. Others are more complex and chances are good you should be leveraging them through existing libraries like Elastic Search or Solr. This post will give you a glimpse into what those systems are doing underneath the hood.

Brute Force

The dumbest thing we could possibly do is keep a list of suggestions and manually check each one to see if the prefix matches:

The space complexity of this approach is O(N) where N is the size of the autosuggest corpus, and the run-time complexity of search is also O(N). We’re not terribly concerned with indexing complexity because typically this is done offline. As long as it’s not a total disaster, it’s not super important.

The Levenshtein distance, or edit distance, of two strings x and y, is the minimum number of changes (additions, substitutions, deletions) you would need to turn x into y. For instance, the edit distance between “bar” and “baz” is 1, a single character substitution. If we want to find suggestions with some degree of typo-tolerance we could frame the problem as “find suggestions that are within an edit distance of n from the user-provided prefix”.

The naive approach to achieving this is to just check the edit distance between the beginning of each suggestion and the prefix:

The dynamic-programming approach to edit-distance calculation would be O(P²) for a prefix of length P, which leaves us with a total complexity of O(NP²). Ouch.

Binary Search

The next obvious thing to try is keeping the list sorted and performing a binary search to find where in the list the prefix belongs. Anything that starts with the prefix should be to the right of that position.

We can perform a binary search to locate where in the list the prefix belongs.

First we make sure to sort the data after indexing:

Then we binary search to find the insertion point, and check all the entries that come right afterwards. Conveniently, Python’s standard library has a bisect_left that does the heavy binary search lifting:

Again our space complexity is O(N), but now our time complexity is O(log(N)).

At this point you may think this exercise is completely academic. As it turns out, Etsy’s autosuggest is actually powered by binary search (combined with edge n-gramming, which we’ll talk about in the next section) thanks to some pretty amazing work by Keyur Govande.

Adding typo-tolerance here would require scanning the index again, which would bring us back to a brute force approach. No bueno.

Edge N-Grams

An n-gram is a sequence n characters that appears in a larger string. So all of the 2-grams (or bigrams) of “search” are [“se”, “ea”, “ar”, “rc”, “ch”]. Note: the term n-gram is sometimes used to denote sequences of other linguistic atoms like words, syllables, etc. In the context of this blog post (unlike the previous), we’re always talking about characters.

An edge n-gram is Elastic Search parlance for an n-gram that starts at the beginning of the string, i.e. a prefix. So all the edge n-grams of “search” are [“s”, “se”, “sea”, “sear”, “searc”, “search”].

For each suggestion in our corpus, we’ll produce all of the edge n-grams, and add them to a dictionary. The value in the dictionary is the set of words that begin with that prefix:

This is pretty close to what we call an inverted index in the field of Information Retrieval (IR). In that context we would call the keys in the dictionary the “term vocabulary”, and the list of suggestions or documents attached to each key the “postings list”.

Instead of a dictionary, you can imagine shoving these in a database, cache, etc. Prior to using Etseek (i.e. a binary-search-powered KV store), Etsy used Redis, and before that we used the same sharded MySQL databases that we used to serve the rest of the site.

When you search for a prefix, all we have to do is look it up in our dictionary.

This technique trades space for time. For a string of length M, we’ll produce M prefixes — one ending on each character. So we end up with O(MN) entries in our dictionary, where M is the length of the longest suggestion.

For each entry, we have to keep the list of suggestions, which could be the entire corpus. So theoretically this is O(MN*N) or O(MN²) 😬

Realistically you won’t have every suggestion appearing for every prefix. But the length of the dictionary entries can be a problem for short prefixes. A single letter like “b” is likely to have a very long list of suggestions. This is expensive in terms of storage, but perhaps more importantly, if we have to do complex scoring with every suggestion for “b”, that will translate into more CPU and latency. Short prefixes are the first and most common interaction of your users with an autosuggest system. You don’t want that experience to be a bummer.

An easy solution to this problem is to only keep the top z suggestions, where z is the number you would ideally like to show the user. You’ll need a way to determine which suggestions are the best, which we’ll talk about more in depth in the next post. For now, assume we can assign a weight to each suggestion, and use that to sort and keep z. That will bring us back into O(MN) land, and ensure that we only have to rank precisely the number that the user is interested in seeing. In IR, this process of stopping as soon as you have z matches, rather than processing all of the potential matches, is called early termination, and we’ll see it come up again soon.

In return for all this space we get constant-time search. Neat.

Unfortunately, it’s difficult to add typo-tolerance to this scheme without a combinatorial explosion of dictionary entries or going back to scanning.

N-grams (aka Fuzzy Search)

Using a similar technique with normal n-grams, we can achieve a great deal of typo-tolerance. This is often how fuzzy search in IDEs works, and the same data structure powers Hound.

Remember that the n-grams of a string x are just a sliding window of all the n-sized substrings of x. So ngrams(“food”, 2) == [“fo”, “oo”, “od”]. We’re going to take these and index them just like we did with edge n-grams.

One hiccup with this style of n-gramming is the beginning and the end of the string. What happens if the user searches for “f”? We only know about “fo”! In order to allow matches at the beginning or the end of the string, people will often pad the string with a special character, e.g. ngrams(“food”, 2) == [“_f”, “fo”, “oo”, “od”, “d_”]. Since we’re using n-grams for autosuggest, and thus are searching for prefixes, we’ll only pad the beginning of the string (i.e. we’ll keep “_f” and ditch “d_”):

Let’s try it out:

Now for each suggestion in our index, let’s generate the n-grams, and for each n-gram we’ll store the list of documents that have that n-gram:

This is what our dictionary looks like after indexing a few words:

The search procedure for an input prefix p is as follows:

  1. Produce the set of n-grams of p (e.g. “foe” turns into [“_f”, “fo”, “oe”]).
  2. For each n-gram, retrieve the list of matching suggestions (like “foo”).
  3. For each suggestion, keep track of the number of matching n-grams.
  4. Return any suggestions that have more than match_percentage matches. This parameter allows us to control how fuzzy the search is.

In code:

Now let’s try it out:

This is similar to what’s happening underneath the hood in ElasticSearch when you use the NGram Token Filter along with the minimum_should_match parameter at search time (example).

This percentage-based matching scheme is not the only way of doing things. Often folks will use the n-gram index to identify a set of candidate matches, and then perform more expensive filtering on these candidates.

Details

There are a number of details to consider and knobs to tune with this strategy.

First, obviously, there is the question of how large your n-grams should be. It’s a classic memory/storage vs CPU trade-off. Larger n-grams will take up more space, but in turn you will have to evaluate fewer candidates at search-time. Smaller n-grams will take up less space, but require evaluating more candidates. N-gram size will also affect fuzziness — larger n-grams mean more precision and smaller n-grams mean more fuzziness. In practice, trigrams are a common sweet-spot.

The padding at the beginning of our n-grams allow us to match short-prefixes and gives some extra match weight to the beginning of a suggestion. But a suggestions doesn’t have to match the beginning of the word. You could, for instance, search for “fo” and find “tinfoil”, which is a little strange for autosuggest.

Ideally we should incorporate some notion of position: where in the query string or the suggestion string a particular n-gram appears. You could incorporate position information into the postings list, you could partition your n-gram index by position, or you could simply do some additional search-time processing. For instance, you could go back to checking the edit distance between the prefix and the start of the suggestion:

Or you could do something as simple as enforcing that the first character always matches. This is a blunt approach so you’ll miss out on misspellings in the first position, but it’s cheap and offers a reasonable blend of prefix-searching and typo-tolerance.

Complexity

Setting aside some of these details, let’s think about complexity for a second.

How many n-grams could there possibly be? If we have an alphabet (or character set) of size A, for every position in an n-gram there are A possible characters. For bigrams, we have A choices in the first position, and A choices in the second, which means there are A² combinations. In general, for n-grams of size Z, we will end up with A^Z possible n-grams.

So we have A^Z possible entries in our n-gram index. In the worst-case, each of the N suggestions are attached to each n-gram. So that’s O(NA^Z).

There will be a large disparity between worst case and average case complexity because we’re not indexing random strings — we’re indexing natural language which has a non-random distribution of characters and thus a non-random distribution of n-grams. In fact, those patterns are so far from random that examining character/n-gram distributions is a good language detection strategy². This worst case is definitely unrealistic but it gives us some important directional information:

  1. N-Gram indexes can get big
  2. Individual N-Gram index entries can be large, i.e. the short-prefixes problem

When we talked about edge n-grams we dealt with #2 by only keeping the top suggestions for each dictionary entry. In this case, that strategy isn’t viable because if you search for a prefix p with a match_percentage m, and p turns into n-grams a and b, we need to keep track of each suggestion attached to either a OR b in order to determine whether or not they satisfy m. It’s impossible to know at index-time, for a given n-gram, which suggestions to keep.

Which brings us to run-time. For a prefix of length P, and n-grams of size M, we will generate P-M+1 n-grams (remember in this implementation the ending is not padded). For each n-gram, we need to accumulate potentially N matches. Then we need to process each of the N potential matches and evaluate the match_percentage. So that’s O((P-M+1)N + N) which is just O(PN).

So we’re definitely paying a price for typo-tolerance on both the storage and the computation side of the equation, but there are many production systems that have successfully scaled these techniques.

Tries

You can’t talk about autocomplete without talking about Tries. Etsy Search used to ask interviewees to implement an autocomplete system as a take-home assignment, and the most common implementation was using a Trie.

The idea is pretty simple. Rather than storing our suggestions in a sorted array, or a dictionary of prefixes/n-grams, let’s put them in a tree. The root of the tree is empty and represents the start of a suggestion. The leaves represent complete suggestions that exist in our corpus. The edges between nodes are labelled with characters. The path from root to leaf is the sequence of characters that make up that suggestion. The k-th level in the tree corresponds to the k-th character across all the suggestions.

A Trie containing the words “foo”, “bar”, and “baz”.

Let’s try a recursive version for funzies. Each node has a dictionary of children, keyed on the subsequent character. When we insert a string, we pop the first character off, find or initialize a node for that child, and then ask the child node to insert the remaining characters:

Indexing just amounts to inserting every suggestion on the root node:

To perform a search, we find the node that corresponds to a prefix, and then find all the leaves we can reach from that node:

To find the node for a prefix, we recursively pop off the characters from the prefix and retrieve the corresponding child node:

To find the leaves, we traverse depth-first from that node to each leaf, accumulating characters along the way:

Lucene’s TSTLookup is an implementation of a Trie variant, which is exposed for autosuggest in both ElasticSearch and Solr.

Complexity

The worst-case for a trie is that none of the words overlap, so you end up with one node for each character in every suggestion:

In the worst-case for a Trie, your terms share no prefixes.

If M is the length of the longest suggestion, your space complexity is O(MN). Trie’s exploit the fact that words often start with the same prefix to save on storage, e.g. if you already have “category” in your Trie, adding “cat” costs you nothing. So in practice they can be much more compact than other representations, and large indexes are easily stored in memory.

The run-time complexity for finding the node corresponding to a prefix of length P is O(P), which is fantastic. But the expensive part is finding all the leaves. This can be O(MN) — this is effectively the complexity of finding all the leaves from the root, and hints at the recurring problem of dealing with short-prefixes that can have a disproportionate number of suggestions.

To improve on the performance and storage overhead of Tries, and find some way to mitigate the pain of short-prefixes, we need to talk about DFAs.

Deterministic Finite Automata (DFAs) and Finite State Transducers (FSTs)

A simple DFA that accepts the strings “foo” and “bar”. Look familiar?

It’s not a perfect metaphor, but I like to think of a Deterministic Finite Automata (DFA) as a Set<String>. It takes a sequence of symbols (or characters) as input, and tells you whether or not the sequence belongs to the language (or set). Underneath the covers, a DFA is a state machine. There is a start state and an end state, states in a graph between the two, and there are labelled transitions between states. A DFA has a single current state (which is why it’s called deterministic). As it processes the input, it updates its state according to the transition graph. If at the end of the input we’ve reached the end-state, the input is valid, or belongs to the language/set.

This should look familiar. A Trie is pretty similar to a DFA, and in fact you can represent a Trie as a DFA. This is useful because DFAs are a well-studied phenomenon: there are tons of known algorithms for doing interesting things with them. For instance, there are algorithms for compressing the state-graph of a DFA to use the fewest states/edges necessary. This is called DFA minimization:

A minimal DFA uses the fewest possible states. Image taken from “Finite State Automata in Lucene”

We know that Tries take advantage of shared prefixes to save space. But words also often have shared suffixes, e.g. “zoology” and “biology” share “ology”. We’re wasting a lot of space duplicating “ology” everywhere. By treating our Trie as a DFA and minimizing it, we can regain that space:

Minimizing a Trie/DFA allows us to save space for shared suffixes.

If a DFA is a Set<String> a Finite State Transducer (FST) is a Map<String, Number>. It’s a DFA that allows us to associate a numeric (or really just commutative) value with a valid input sequence. For autosuggest that’s useful because it allows us to associate a weight or score with a suggestion:

An FST allows us to associate a weight to a path. In this example, “bar” has a weight of 5 and “baz” has a weight of 10.

There are many ways we could represent the path weights. Instead of placing them at the end, we could place them at beginning:

Shifting path weights to the beginning allow us to perform early termination.

This allows us to do early termination again by doing a depth-first traversal in weighted order. In the example above, if the user searches for “b” we can first check the 10-weighted branch. If we only need a single suggestion, we’ll find “baz” and then we’re done. If you have a high-cardinality or real-numbered set of weights, you can just make weight buckets at the beginning of the FST.

Lucene’s FSTCompletionLookup is based on FSTs and is exposed in Solr and ElasticSearch to power autosuggest. Dawid Weiss has a great presentation on the subject of DFAs/FSTs in Lucene, which inspired some of my illustrations.

It turns out that given some string s and a maximum edit distance d, you can construct a Levenshtein Automata that will only accept strings that are within d edits of s. We can use this to produce typo-tolerant suggestions. As we depth-first traverse our FST, we can feed our steps into the Levenshtein Automata. If it accepts the input character, we proceed. If it doesn’t, we can ignore that whole branch, and then backtrack. There’s a amazing writeup of the process here. Automata, talking to each other!

Portrait of Lucene developers trying to decipher automata papers.

Tangent: If you ever feel like you’re a dumb imposter, go read about how all of these ridiculously smart, veteran Lucene committers thrashed around desperately trying to translate a DFA paper into working code.

I’ll leave the complexity analysis as an exercise to the user 😜 but benchmarks suggest FSTs yield better latency/throughput figures for short prefixes while using an order of magnitude less memory.

Coming Up

I hope that was a fun tour through some autosuggest datastructures and algorithms. Barring the obviously terrible brute force option, I’ve seen every approach utilized in production at scale. I would recommend the FST facilities offered by Solr and ElasticSearch, but don’t be afraid to start simple if you’d like to avoid adding a new system or just don’t have experience operating Solr/ES.

In the next post, we’ll talk about ranking.

You can find the reference implementations of these algorithms here.

Thanks to Steve Mardenfeld, Nishan Subedi, Stan Rozenraukh, and Keyur Govande for reading and giving feedback!

️❤️ Please feel free to reach out with any comments, questions, corrections, etc. ️❤️

[1] —S. Cucerzan and E. Brill. Spelling correction as an iterative process that exploits the collective knowledge of web users. In EMNLP, 2004

[2] — Dunning, T. (1994) “Statistical Identification of Language”. Technical Report MCCS 94–273, New Mexico State University, 1994.

--

--

Giovanni Fernandez-Kincade
Related Works

Co-Founder @ Related Works. Formerly @ Etsy. Search, Data, and Surfing (poorly) in the Rockaways.