Making your search not suck with Elasticsearch — Part 4: Overanalyzing it

Alex Denton
4 min readJun 5, 2017

--

This is part 4 of a multi-part series. In this series I will be explaining important concepts in Elasticsearch and using a demo app I built to demonstrate these concepts. You can follow me for updates as they come out over the next few weeks. If you’d like to start at the beginning click here. You can also find a full index of the series at the end of this post.

Well, my pace with these has been more like once a month than once a week. At this rate I should be done sometime around 2020. Just in time for something to completely replace Elasticsearch and make all of this totally irrelevant. Perfect!

At any rate, in my last post we started improving our search results by using non-default settings. Specifically we used the snowball analyzer instead of the standard analyzer so that non-exact matches would still be returned in our search results (for example searching for “star war” would still return results for related to “star wars”). However, in doing so we realized that because the snowball analyzer stems all the words in the index down to their root word (“wars”, “warring”, and “warred” all became just “war) we lost the information needed to appropriately rank exact matches over non-exact matches. That left us with a dilemma we haven’t yet resolved: if we use the standard analyzer then no non-exact matches will ever be returned but if we use the snowball analyzer then non-exact matches could be ranked over exact ones. Today we will resolve that dilemma.

In fact the solution is very simple: use both. This is such a simple concept but when I first started using Elasticsearch it was not obvious to me at all:

It is not only possible but pretty much expected that you will index your data multiple times with multiple analyzers in Elasticsearch.

If you’re smarter than me this might’ve been completely obvious to you and if that’s the case then I’m very happy for you because I spent way too long banging my head against this problem. Because I didn’t understand this I was creating a bunch of Frankenstein’s monster-like analyzer that were really wacky combinations of stemming token filters with edge n-gram token filters and maybe some phonetic token filters sprinkled in for good measure. Suffice it to say, I didn’t know what I was doing and succeeded only in creating horrifying monstrosities that yielded very surprising search results.

The right way to accomplish this is not by creating one giant analyzer with a whole bunch of character filters, tokenizers, and token filters and indexing your data once, but instead by creating several analyzers with a handful of character filters, tokenizers, and token filters and indexing your data multiple times. Elasticsearch provides a pretty convenient way of doing this with multi-fields. Multi-fields allow you take a given field such as movie.name and add many “sub-fields”. In our case we added movie.name.standard and movie.name.snowball.

These paths represent the indexes of the field that have been analyzed by the standard analyzer and snowball analyzers respectively. With these paths we can query the different indexes separately. I haven’t covered relevance and scoring yet but the short story is for each result in a query a numerical score is returned which is meant to represent it’s relevance. There’s a lot that goes into computing that score but for now all you need to understand is that for a given search if a result matches on multiple queries the score contribution from each query is summed together.

How does all of this help us improve our search results? Well, let’s go back to our original problem. Consider that for every query term if something matches on the standard analyzer it is guaranteed to match on the snowball analyzer for the same token. For example, if you search for “wars” and it finds a match in the standard index then it will also match “war” in the snowball index. By searching with both analyzers we know that not only will all the non-exact matches be found but because the exact matches will match in both the standard and snowball index the relevance score from each query will be summed and they are guaranteed to be ranked higher.

Brilliant! That’s how it works in theory at least. Recall, that previously using only the snowball analyzer the first result was “Stars Warring”. Let’s see how this brilliant solution works using both the snowball and the standard analyzer:

And you can see that “Stars Warring” is no longer the first result. Our results are getting progressively better but are far from perfect. In particular you might notice that the first result is now “Star of Wars” which I think you will agree is not expected to be the first result when searching “star wars”.

Why? Well, the short story is by default Elasticsearch has no regard for search “phrases”. The only thing it cares about is that a given document contains the individual query terms — not whether they are in close proximity to one another. I’ll explain this problem more and the solution in my next post Making your search not suck with Elasticsearch — Part 5: Are we still doing phrasing?

The “Making your search not suck with Elasticsearch” series:
Part 1: What is an index?
Part 2: Elasticsearch is not magic
Part 3: Analysis Paralysis
Part 4: Overanalyzing it
Part 5: Are we still doing phrasing?
Part 6: Totally irrelevant
Part 7: Machines that learn

--

--