Elasticsearch Series: Let It Work for You

This is the second entry in a series on Elasticsearch and how we use it in our applications. See the previous entry on Rebuilding Indices with No Downtime.

Elasticsearch has a vast number of ways to query the data in your index. How you query your data is very much dependent upon where it is being used, so in this post we’re going to talk about how we use it in our instructor search application.

Historically, most of our searches were performed on denormalized MySQL tables. This included using an arc distance formula as part of a SQL statement with a ton of “like” statements and rudimentary ordering based on the input by the user. As you no doubt have guessed, this isn’t performant and the results were not very good.

The results weren’t relevant.

Relevancy is key when doing any sort of search, and this is incredibly difficult to accomplish with a database that wasn’t designed for search. It’s up to the engineer to figure out what is relevant to the user based on what was searched. If the user searched for an instructor by their name, it seems logical to sort their name alphabetically. What if they searched for an instructor by name who is within 10 miles of current location? Do you sort by their distance from the user and then alphabetic?

These are tough questions to answer and even if you do find an answer, sorting them in any sort of structured or weighted way is very difficult with this sort of setup.

Lucene Lights the Way

Elasticsearch is built upon a search engine called Lucene. It is the engine that takes the query you provide Elasticsearch and gets the data from the index. One very important function it serves is a calculation of “score”. This is a calculation that codifies how relevant a particular result is to the original query.

At the beginning when we were first migrating off the search from MySQL to Elasticsearch, it was really tempting to sort the same way we did with the original MySQL query, but we soon learned that the results returned were just as bad (if not worse) than MySQL. We had to learn how to shape our query to take advantage of the “score”.

Show Me the Query!

First things first, we have a filter of distance available on the application: let’s say we’re trying to find instructors in New York, NY within 5 miles of a central point. We might do something like this:

“query”: { 
“bool”: {
“filter”: [{
“geo_distance”: {
“distance”: “8.04672km”,
“address.geo”: {
“lat”: “40.7643358”,
“lon”: “-73.9849351”
}
}
}]
}
}

This would find instructors within 5 miles of the center of New York, NY:

  1. Chris Saylor — 1.4 miles away
  2. Aaron Sonders — 0.5 miles away
  3. Bethany Smith — 3 miles away
  4. Chris Saylor — 0.2 miles away

Notice the results are scattered at varying distances from that center because this query doesn’t have any relevancy. Wait…what? You just said that’s what Elasticsearch does for you! It comes down to filtering versus querying. Anything you filter is a binary thing: what “must” be in results. As such, it doesn’t calculate a score. On the other hand, when you query for something, it is what “should” be included and this is where score is calculated. It would be tempting (and easy) to sort by the distance from the center, but that would only work if that is the only factor.

Now, let’s look for an instructor named “Chris Saylor” within 5 miles of New York, NY:

“query”: {
“bool”: {
“must”: [{
“query_string”: {
“default_field”: “full_name”,
“query”: “chris saylor”
}
}],
“filter”: [{
“geo_distance”: {
“distance”: “8.04672km”,
“address.geo”: {
“lat”: “40.7643358”,
“lon”: “-73.9849351”
}
}
}]
}
}
  1. Chris Saylor — 1.4 miles away
  2. Chris Saylor — 0.2 miles away
  3. Anthony Saylor — 1.6 miles away

The full name query has relevancy (thanks Lucene), but the distances are still scattered because the geo distance filter doesn’t get weighted. Let’s make it contribute to the weight using an Elasticsearch mechanism called function scoring:

{
“query”: {
“function_score”: {
“query”: {
“bool”: {
“must”: [{
“query_string”: {
“default_field”: “full_name”,
“query”: “chris saylor”
}
}],
“filter”: [{
“geo_distance”: {
“distance”: “8.04672km”,
“address.geo”: {
“lat”: “40.7643358”,
“lon”: “-73.9849351”
}
}
}]
}
},
“functions”: [{
“linear”: {
“address.geo”: {
“origin”: {
“lat”: “40.7643358”,
“lon”: “-73.9849351”
},
“scale”: “8.04672km”
}
}
}]
}
}
}
}
  1. Chris Saylor — 0.2 miles away
  2. Chris Saylor — 1.4 miles away
  3. Anthony Saylor — 1.6 miles away

What’s going on here? We introduced a linear decaying function that calculates a value (between 1 and 0) that decays the further away a result is from the central point. The relevance of the distance is now included with the relevance of the name we searched. Since there is more than one instructor named “Chris Saylor” in that location, the instructor closer to that central point will appear higher in the results.


This post barely scratches the surface of the power that Elasticsearch has for complex queries of data, but I hope it has given you some insight on how we use it in a real world application. Give it a shot and search for an instructor near you and perhaps even take one of their classes.