Elasticsearch Analyzers — A Brief Introduction

Mateus Forgiarini da Silva
CWI Software
Published in
6 min readOct 16, 2018

--

If you are new to Elasticsearch, you probably have done some query and did not understand why the result was not what you were waiting for. If this has never happened to you, do not worry, because the day will come. The biggest challenge to those who come from a relational world is to understand how queries are made inside a cluster. It happens because Elasticsearch searches for documents based on the inverted index generated when documents are added to the index. What about updates? In fact, there are no updates. All the data is immutable, so an update is under the hood a delete followed by an add. We need to have this concept in mind because we do not search for documents based on its content, but on the terms that are in the inverted index.

To illustrate the situation better, suppose we have an index with all the Instagram comments and we want to search for all the trends on Instagram. We could just look for comments that have the hash symbol.

GET comments/default/_search
{
"query": {
"match_phrase_prefix": {
"content": "#"
}
}
}

However, when you make the query above, you do not have any results. Suspicious, right? Let’s make another query in order to bring all the documents then.

GET comments/default/_search

And voilà, it returns documents with the hash symbol.

"hits": [
{
"_index": "comments",
"_type": "default",
"_id": "xHmSb2YBxPCL3T_xVs4p",
"_score": 1,
"_source": {
"content": "#tbt #food",
"user_id": "3223N90d2v0Xrqjm"
}
},
{
"_index": "comments",
"_type": "default",
"_id": "xXmSb2YBxPCL3T_xec5m",
"_score": 1,
"_source": {
"content": "#tbt #travel #love",
"user_id": "3223N90334343212v0Xrqjm"
}
},
{
"_index": "comments",
"_type": "default",
"_id": "w3mSb2YBxPCL3T_xMc4z",
"_score": 1,
"_source": {
"content": "#tbt #love",
"user_id": "3223N9LCGYB9x0d2v0Xrqjm"
}
}
]

Why does it happen?

It happens because, for text type fields, the inverted index is generated by Analyzers. To better explain this process, it is necessary to have an overview on the cluster architecture.

An index is nothing more than a collection of documents that are spread among shards. Therefore, if an index is bigger than the disk space of a single node, Elasticsearch will distribute the documents to other shards located in other nodes (if there is more than one node). This is why Elasticsearch is scalable.

A shard is a Lucene index and each shard is composed by segments that hold the inverted index and it is over them that our queries are made.

The comments of the example above would be stored in the inverted index as follow:

Inverted Index

So when we search on Elasticsearch, we are in fact searching for terms that are stored in the inverted index of each Lucene index segment, and they will say which documents have the content of our search.

When we look at our inverted index, we see that the hash symbol is not in there. That is why our query did not bring any results. But how is it possible? We know that we have a post with the exact content we are searching for.

It happens because before we index a text field, the content goes through an analyzer that will transform our content in terms that will be stored in the inverted index.

Analyzer

An analyzer is made of three steps, i.e., character filters, a tokenizer, and token filters.

A character filter receives a string and returns another string adding or removing characters. A good example is the html_strip character filter that removes the html tags from the text. An analyzer can receive more than one character filter.

POST _analyze
{
"tokenizer": "keyword",
"char_filter": [ "html_strip" ],
"text": "<p>Hello <strong>world!</strong></p>"
}
Result: Hellow world!

A tokenizer receives a string and transforms it in tokens. For example, the standard tokenizer convert a string into an array of strings, splitting the terms according to the Unicode Text Segmentation algorithm that basically removes the symbols, punctuation, and make a split on empty spaces.

POST _analyze
{
"tokenizer": "standard",
"text": "Hello world."
}
Result: ["Hello", "world"]

Lastly, there are the filters where the result of the previous two steps will go through. A filter receives tokens generated by the tokenizer and converts them into another tokens.

POST _analyze
{
"tokenizer": "standard",
"filter": ["lowercase"],
"text": "Hello world"
}
Result: ["hello", "world"]

When we do not specify which analyzer our field will have in the mapping, the standard analyzer will be the default analyzer that has as its tokenizer the standard tokenizer.

Another interesting situation happens when we made a query using the term match, under the hood our search content will go through the same analyzer that is mapped for the field. In our case, we are not looking for anything, as we just saw that the standard tokenizer removes the symbols a string has. So now, it is clear why our query did not bring any results.

Query Time vs Index time

It is essential to understand how analyzers work in order to deal with performance in our cluster because we can manage to have more performance at the query time or at the index time.

In the real world, in the example of looking for trends on Instagram, it would be terrible slow because we were using a solution at query time. However, how this actually works?

Elasticsearch gives us some flexibility to search for partial matches at the query time (prefix, wildcard, regex and match_phrase_prefix). However, the solution has a downside because we are always looking for terms in the inverted index. Since we store the whole word using the standard analyzer, when we make use of a solution at query time, we would have to go through all the terms of the inverted index to see if the term has a partial match with what we are searching for. Imagine doing that on an index of 15 million terms! Clearly, the speed of an inverted index is jeopardized by using this approach.

Fortunately, the use of analyzers can speed up the search according to the business rule of our project as Elasticsearch has its built-in analyzers, but it also gives us the ability to make customized analyzers. For example, there is a very useful token filter named “stemmer”, which converts the words to its root format. Thus, if we need to search for a word, without considering its variations, we could just search for its root format instead of applying all the variations a word can have.

POST _analyze
{
"tokenizer": "standard",
"filter" : [{
"type" : "stemmer",
"name" : "english"
}],
"text": "calling called call"
}

Result:

{
"tokens": [
{
"token": "call",
"start_offset": 0,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "call",
"start_offset": 8,
"end_offset": 14,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "call",
"start_offset": 15,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 2
}
]
}

In addition, there is the edge_ngram tokenizer that break words into N-Grams, making it useful for search-as-you-type queries, even though we could just use a completion suggester.

Obviously, with the index time solution, we lose performance at the insertion of a document. This is the reason why it is so important to know your business rule and the user priorities as the query and the index time solution both target the same problems, with different costs.

My point in this medium post was to show the importance of analyzers in the building of the inverted index, as basics concepts are easily despised, we should always take the time and look into them to have a stronger knowledge of the tools we make use of.

This article was based on the Elasticsearch documentation, I strongly advise you to take a look into it. :)

--

--