Writing a space-ignoring autocompleter with ElasticSearch

Dave Dash
2 min readMar 12, 2015

If you followed the ElasticSearch Guide you can build a decent autocompleter. If you had an index of companies near you, typing “Pizza” would probably land you “Pizza Hut”, “Pizza My Heart”, “Pizza Express”, etc.

If you try to narrow those choices by typing in “Pizzah” and forget that oh-so-critical space, you’ll end up empty.

Let’s find out why by looking at the autocomplete analyzer:

When this indexes “Pizza Hut” it first uses the standard tokenizer to turn this into two tokens: Pizza and Hut. Then each token is run through a few filters, the result of the lowercase filter is pretty obvious: pizza and hut. These tokens are now pushed into our custom autocomplete_filter which then does edge-ngrams: p, pi, piz, pizz, pizza, h, hu and hut. Note that pizzah is missing.

If we skip that initial standard tokenization we’d have a final tokenization of p, pi, piz, pizz, pizza, pizzah, pizzahu and pizzahut. This isn’t bad and it’s easily achieved by chosing the keyword tokenizer. I don’t like this because we can get into trouble with people not always searching by the first word. Imagine a place called “Bob’s Pizza Mart” that people just happen to call “Pizza Mart”. “Pizza” wouldn’t…

--

--

Dave Dash

DadOps 24/7 and DevOps Consultant. Formerly @Pinterest and @Mozilla