Making your search not suck with Elasticsearch — Part 3: Analysis Paralysis

Alex Denton
5 min readApr 30, 2017

--

This is part 3 of a multi-part series. In this series I will be explaining important concepts in Elasticsearch and using a demo app I built to demonstrate these concepts. You can follow me for updates as they come out over the next few weeks. If you’d like to start at the beginning click here. You can also find a full index of the series at the end of this post.

It’s been a while since the last installment in this series and it’s long overdue. Last time we started actually using Elasticsearch and we learned that the default settings aren’t so good. Today we’re gonna try using some different settings and we’ll see how they affect our search results. Specifically we’re going learn about analysis and how we can use it to make our search results better.

When we tried searching with the default settings we got some surprising results. For example, when we searched for “star war” instead of “star wars” nothing related to the Star Wars films came back in the top results. This is because Elasticsearch is quite literal about searching and the word “war” does not match the word “wars”. How can we fix that? Well, that’s where analysis comes in.

Analysis is the process of taking text and turning it into tokens. Roughly speaking it’s the rules you apply when taking documents like this:

1. The quick brown fox jumped over the lazy dog2. The Cat in the Hat3. Fox in Socks4. Star Wars: The Empire Strikes Back

And turning them into an index like this:

How does this help us improve our search results? Well, there’s a lot of different rules we can apply when transforming the raw data into an index. Suppose for each word we encounter when building our index we chopped off the end to get the root word such that “jumped” becomes “jump”, “socks” becomes “sock”, and “wars” becomes “war”. If we did that for all the documents we would end up with a new index that looks like:

Let’s go back to our original example of searching for “star war”. Using this new index we first look for the word “star” and (just like before) we find it; then we look for the word “war” and this time we find it! Things related to the star wars films should now be returned as top results. Eureka! Problem solved!

This process of chopping off the end of words is called stemming. It’s one of the many rules you can apply as a part of the analysis step in Elasticsearch. To use stemming on your index you just need to use an analyzer that defines stemming as one of its rules during analysis. In general, to apply different rules during analysis you use different analyzers. In my demo app I’ve added the option of using the built-in snowball analyzer to search. Let’s see how using it affects our search results:

As you can see the results are better but not perfect. We’re at least getting things related to the Star Wars films in our top results. I’ll explain exactly why we’re getting the results we’re getting a little later. For now, I want to explain a bit more about how analyzers work in Elasticsearch.

Analyzers in Elasticsearch are composed of three parts: character filters, tokenizers, and token filters. Each of these parts feeds its output as input to the next.

It starts with with the character filters and they pretty much do what they say on the tin. They take the blob of text from your document and filter out any undesirable characters. Say for example you’re indexing html content and you don’t want the html tags to affect your search results

Next comes the tokenizers. They take the stream of characters output by the character filters and break them into tokens. For human-readable content this typically means splitting the stream of characters into words based on punctuation and whitespace. Also pretty straightforward.

Lastly comes the token filters and these are a little more complicated. They take the stream of tokens and for each token they either add, remove, or modify a token. If you’re adding a token it’s probably because there’s some sort of related word or synonym that you want to associate with the token. For example, when searching for “difficult” you might also want to bring back results that contain the word “challenging”. If you’re removing a token it’s because you don’t want to be able to search by that word. It could be a very common word like “the” that you just don’t want to affect search results or just another word that for business reasons shouldn’t be searchable. Then there’s modification. This is actually how the snowball analyzer stems our words before putting them into the index. For example, it takes the token “jumped” and turns it into “jump”.

That’s the basics of analyzers. To apply different rules during analysis you just use different analyzers that have different combinations of character filters, tokenizers, and token filters. Elasticsearch comes with some built-in analyzers such as the standard and snowball analyzers but you can also define your own.

Now that we understand analyzers a bit better let’s go back to the search we did earlier with the snowball analyzer and let’s see if we can understand why we got the results we did. To illustrate the problem I’m going to search again but this time I’ll actually search for “star wars” instead of “star war”:

Notice that we get exactly the same results despite changing our query. This is a clue as to what’s going on. Let’s look in particular at the second result “Stars Warring”. Why would “Stars Warring” be returned above “Saving Star Wars” which contains the exact query we searched for? Well, remember our new index looks like this:

To the index the words “wars”, “warring”, and every variation of the word “war” are exactly the same now. By using the snowball analyzer we basically lost the information we needed to appropriately rank exact matches over non-exact matches. This seems to to leave us with a dilemma. By using the the standard analyzer non-exact matches won’t be returned at all but by using the snowball analyzer non-exact matches can be ranked over exact ones.

Luckily there is a pretty simple solution to this dilemma but we’ll save that for next time! It’s a simple solution but it leads to an insight about analyzers that wasn’t immediately obvious to me when I first started using Elasticsearch.

Look out for that solution in my next post: Making your search not suck with Elasticsearch — Part 4: Overanalyzing it”.

Until then, happy searching!

The “Making your search not suck with Elasticsearch” series:
Part 1: What is an index?
Part 2: Elasticsearch is not magic
Part 3: Analysis Paralysis
Part 4: Overanalyzing it
Part 5: Are we still doing phrasing?
Part 6: Totally irrelevant
Part 7: Machines that learn

--

--