Writing a space-ignoring autocompleter with ElasticSearch
If you followed the ElasticSearch Guide you can build a decent autocompleter. If you had an index of companies near you, typing “Pizza” would probably land you “Pizza Hut”, “Pizza My Heart”, “Pizza Express”, etc.
If you try to narrow those choices by typing in “Pizzah” and forget that oh-so-critical space, you’ll end up empty.
Let’s find out why by looking at the autocomplete analyzer:
When this indexes “Pizza Hut” it first uses the standard tokenizer to turn this into two tokens: Pizza and Hut. Then each token is run through a few filters, the result of the lowercase filter is pretty obvious: pizza and hut. These tokens are now pushed into our custom autocomplete_filter which then does edge-ngrams: p, pi, piz, pizz, pizza, h, hu and hut. Note that pizzah is missing.
If we skip that initial standard tokenization we’d have a final tokenization of p, pi, piz, pizz, pizza, pizzah, pizzahu and pizzahut. This isn’t bad and it’s easily achieved by chosing the keyword tokenizer. I don’t like this because we can get into trouble with people not always searching by the first word. Imagine a place called “Bob’s Pizza Mart” that people just happen to call “Pizza Mart”. “Pizza” wouldn’t even pull up “Bob’s Pizza Mart”.
The trick is to get the best of both worlds. The keyword tokenizer does a great job of doing nothing. We just need to add a filter to split the words as well as concatenate all the words. We can do that with with the Word Delimeter Token Filter by setting the catenate_all attribute:
We can add this filter into our autcomplete tokenizer:
So now we can take our example of “Pizza Hut”:
- keyword tokenizer (does nothing): Pizza Hut
- lowercase filter: pizza hut
- word_joiner filter: pizza, hut, pizzahut
- autocomplete_filter: p, pi, piz, pizz, pizza, h, hu, hut, p, pi, piz, pizz, pizza, pizzah, pizzahu and pizzahut.
If you add this analyzer to your index you can use curl and jq like so:
$ curl -s -XGET 'localhost:9200/companies-test/_analyze?analyzer=autocomplete&pretty=1' -d "Pizza Hut"|jq ".tokens|.token" -r
This will let you see just how things get tokenized.
In non-real-world tests this solution seems to handle the opposite problem of searching for “Bat Man” in order to find “Batman.”
Thanks to Rob Hudson, Will Kahn-Greene and Erik Rose at Mozilla who patiently told me I was doing it wrong and helped me come up with this solution.