Writing a space-ignoring autocompleter with ElasticSearch

Dave Dash
Dave Dash
Mar 12, 2015 · 2 min read

If you followed the ElasticSearch Guide you can build a decent autocompleter. If you had an index of companies near you, typing “Pizza” would probably land you “Pizza Hut”, “Pizza My Heart”, “Pizza Express”, etc.

If you try to narrow those choices by typing in “Pizzah” and forget that oh-so-critical space, you’ll end up empty.

Let’s find out why by looking at the autocomplete analyzer:

When this indexes “Pizza Hut” it first uses the standard tokenizer to turn this into two tokens: Pizza and Hut. Then each token is run through a few filters, the result of the lowercase filter is pretty obvious: pizza and hut. These tokens are now pushed into our custom autocomplete_filter which then does edge-ngrams: p, pi, piz, pizz, pizza, h, hu and hut. Note that pizzah is missing.

If we skip that initial standard tokenization we’d have a final tokenization of p, pi, piz, pizz, pizza, pizzah, pizzahu and pizzahut. This isn’t bad and it’s easily achieved by chosing the keyword tokenizer. I don’t like this because we can get into trouble with people not always searching by the first word. Imagine a place called “Bob’s Pizza Mart” that people just happen to call “Pizza Mart”. “Pizza” wouldn’t even pull up “Bob’s Pizza Mart”.

The trick is to get the best of both worlds. The keyword tokenizer does a great job of doing nothing. We just need to add a filter to split the words as well as concatenate all the words. We can do that with with the Word Delimeter Token Filter by setting the catenate_all attribute:

We can add this filter into our autcomplete tokenizer:

So now we can take our example of “Pizza Hut”:

  1. keyword tokenizer (does nothing): Pizza Hut
  2. lowercase filter: pizza hut
  3. word_joiner filter: pizza, hut, pizzahut
  4. autocomplete_filter: p, pi, piz, pizz, pizza, h, hu, hut, p, pi, piz, pizz, pizza, pizzah, pizzahu and pizzahut.

If you add this analyzer to your index you can use curl and jq like so:

$ curl -s -XGET 'localhost:9200/companies-test/_analyze?analyzer=autocomplete&pretty=1' -d "Pizza Hut"|jq ".tokens[]|.token" -r
p
pi
piz
pizz
pizza
p
pi
piz
pizz
pizza
pizzah
pizzahu
pizzahut
h
hu
hut

This will let you see just how things get tokenized.

In non-real-world tests this solution seems to handle the opposite problem of searching for “Bat Man” in order to find “Batman.”

Thanks to Rob Hudson, Will Kahn-Greene and Erik Rose at Mozilla who patiently told me I was doing it wrong and helped me come up with this solution.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store