Making your search not suck with Elasticsearch — Part 5: Are we still doing phrasing?

Alex Denton
5 min readJun 13, 2017

--

This is part 5 of a multi-part series. In this series I will be explaining important concepts in Elasticsearch and using a demo app I built to demonstrate these concepts. You can follow me for updates as they come out over the next few weeks. If you’d like to start at the beginning click here. You can also find a full index of the series at the end of this post.

In my last post I showed how we can use multiple indexes with different analyzers to improve our search results. Specifically we used the snowball analyzer with the standard analyzer so that Elasticsearch would still return non-exact matches but would rank exact matches higher. Doing so did improve our results but it also illuminated some other problems with the default settings in Elasticsearch. Namely we saw that by default Elasticsearch does not take into account the proximity of the words we’re searching for. In other words it doesn’t take into account the “phrasing”. For example, by default when searching for “star wars” a document which contains the phrase “star wars” and another document that contains the phrase “star of wars” are basically equivalent to Elasticsearch. For that reason it’s entirely possible for Elasticsearch to rank a document containing “star of wars” higher than one containing “star wars” even though that’s clearly wrong to our human eyes. Today we’re gonna talk about how we can fix that.

Luckily the solution is once again very simple: use phrase matching instead of just a simple match. With phrase matching instead of just checking that a document contains the query terms it measures how close the terms are to one another and weights documents in which the terms are closer together over those in which the terms are farther apart. More precisely when we say “how far apart the terms are” we mean how many times do we have to move a term in the document to make it exactly match our query. For example, if we search for “star wars” and there is a document that contains the phrase “star of wars” it takes one step to make them match exactly:

query: star wars
document: star of wars
Step 1: star wars of
Complete!

For a slightly more complicated example if the document contains the phrase “wars that happen to be near a star” it takes 7 steps:

query: star wars
document: wars that happen to be near a star
Step 1: wars that happen to be near star a
Step 2: wars that happen to be star near a
Step 3: wars that happen to star be near a
Step 4: wars that happen star to be near a
Step 5: wars that star happen to be near a
Step 6: wars star that happen to be near a
Step 7: star wars that happen to be near a
Complete!

You get the idea. The smaller this “distance” between the terms is the higher the relevance score Elasticsearch will give the document. In search engine parlance we call this concept slop and you configure it to only consider something a match up to a maximum slop factor. For example, it’s probably safe to assume that if two terms are a thousand words apart in a document that they are not part of the same “phrase”.

In my application I have pretty generous maximum slop factor of 50. Now the good news is this will improve our search search results but there are some downsides as well.

First, all the extra computation we did to calculate the slop will slow our query down a bit. In most cases, on most datasets this will be hardly noticeable but it is something the keep in mind. The second is more of an experience issue. By using phrase matching if any two query terms are too far apart Elasticsearch will not consider the document a match at all. For example, if your document contains the phrase “wars that happen to be near a star” and your slop factor is only set to a maximum of 2, Elasticsearch will not consider the document a match because the “distance” is too great.

A long time ago, in an Elasticsearch cluster far, far away….

You might find this surprising and unhelpful or it could be totally fine depending on your use-case. Just something to consider. Once again, you can pretty easily solve this issue in much the same way we solved our standard/snowball index dilemma. By querying twice with both a simple match and a phrase match you can guarantee that Elasticsearch will still find all documents that contain your query terms and you can also guarantee that it will rank documents whose phrasing most closely resembles your query at the top. The only cost is, you guessed it, a small hit in performance.

All of that is well and good but let’s get back to our original problem. By using a simple match query we noticed that when we search for “star wars” the first document in our search results was “Star of Wars”. That doesn’t seem right. Let’s see if we can fix this issue by using phrase matching:

Notice that we are using a standard index, a snowball index, and phrase matching in this example. As you can see “Star of Wars” is no longer the first result. Awesome! Now the bad news is these results are still not what you and I expect when we search for “star wars”. Probably when you search for “star wars” you expect the movies (i.e. “Star Wars: The Empire Strikes Back”, “Star Wars: The Phantom Menace”, etc) to be the top results. Fortunately, there is once again a relatively simple way to solve this problem.

However, before I talk about that solution I want to explain in more detail why we’re getting the results we’re currently getting. In particular I want to explain how it is that Elasticsearch computes relevance scores. I’ll do that in the next installment of this series “Making your search not suck with Elasticsearch — Part 6: Totally irrelevant”.

The “Making your search not suck with Elasticsearch” series:
Part 1: What is an index?
Part 2: Elasticsearch is not magic
Part 3: Analysis Paralysis
Part 4: Overanalyzing it
Part 5: Are we still doing phrasing?
Part 6: Totally irrelevant
Part 7: Machines that learn

--

--