Fuzzy… Queries

Choosing sane defaults in Elasticsearch

Oyster Pail
3 min readJun 2, 2014

This post recaps a few of the initial Elastic Search Queries in place at Wampum, and ends with what I consider to be a sensible default that can hopefully serve as a model for other projects looking to get up and running fast with Elastic Search across their data set.

A few links have been sprinkled around Quora and Reddit over the last week to get some intial traffic to Wampum. It has been incredibly helpful to see what people are interested in searching for! With increased traffic come a helpful host of examples, and it has been enlightening to see where the bolts have needed to be tightened down!

Prefix Query

To start out, Wampum was using a simple prefixQuery, which would return results as long as the few characters that were typed into the search bar were also included at the beginning of the words indexed in ElasticSearch. This worked great intially, as typing in just “so”, would yield “socks” or “solar panel”. The problem with this approach is that over the weekend a search for “car” would bring up hits for “cards” and “caring”, and things started to get confusing fast.

Term Query

The initial fix was to just implement a term query, which is the Elasticsearch equivalent of a scalpel that will only cut out the terms that match exactly what you searched for. If you typed in “car”, you would just see results for “car” and nothing to do with “cards”, and if you typed “cards”, you would be shielded from all of the vehicle related results. This approach was a nice reminder that simplicty is key, and while it is nice to guess what the user might have meant, there are still many cases when they know exactly what they want, and its important to make sure that the results give it to them.

Fuzzy Query

Fuzzy queries seemed liked they might be an interesting feature to check out. To find out more about the underlying mechanism behind them, check outLevenshtein distance. If you clicked on the Wikipedia link, you now already know that the levenshtein distance between the words “books” and “boots” is 1. This is just another way of saying that all you have to do to make your pair of “boots” a pair of “books” is to swap out a “t” for a “k”. Let’s say you wanted to swap out “boots” for just “book”, then you would have a levenshtein distance of 2: 1 for changing the “T” to a “K”, and one for dropping the trailing “S”.

While this seemed like the perfect swiss army knife for the job, I found the way that Elasticsearch implements Levenshtein distance to be a bit counterintuitive. What Elasticsearch does before executing the query is find all of the terms that satisfy the Levenshtien distance count you pass up in the parameters, and then execute a query for all of those as if they are equals. So while fuzzy queries are perfect for catching mispellings, in the above example a search for boots has an equal probability of returning books as the first hit in your boots search, as it does as returning a location where you can drop off your boots.

Multiple Queries: 1 Term, 1 Fuzzy. Score the Term Higher

Putting together the best parts of the above, the problem can be solved by using two queries, one for an exact search, and one for a fuzzy search. From what I’ve been trying so far, Elasticsearch seems to handle the scoring pretty appropriately such that the exact term you entered is the highest result, followed by the fuzzier ones. Enough with the chatter, here is the working gist of an implementation.

This seems like a pretty sane default for now, and is defintely a step up over the prefixQuery that was in place. Many thanks to everyone who has been searching already as it has been a great impetus to drive the underlying technology forward!

--

--