Deconstructing the Elastic Search normalizer and analyzer

Published in

Walmart Global Tech Blog

6 min readJul 8, 2024

Elastic Search has gained enormous popularity in recent times. It has become the go to search engine due to its capabilities for lightning-fast searches. It also provides many enhancements that a developer can use to

customize their own elastic search indices. As such one can get lost in the vast sea of features that elastic search provides.

One such feature that Elastic Search provides is the normalizer and analyzer that are used to at the time of indexing documents. Before moving on to the main topic let’s first understand why such preprocessing is required.

Pre-Processing of Documents in Elastic Search

Documents in elastic search are indexed to decrease the query time. This is done by using a data structure called the inverted index. We can better understand the inverted index data structure with an example.

Document 1-> The deadliest animal is the tiger.

Document 2-> The deadliest fish is the shark.

The inverted index:

Term — — Document

The — — 1, 2

deadliest — — 1, 2

animal — — 1

fish — — 2

is — — 1, 2

the — — 1, 2

tiger — — 1

shark — — 1

Using this inverted index elastic search provides a quick look up for search queries. This alone is not the sole reason for elasticsearch’s speed, to understand normalizer and analyzer this understanding of inverted indexes would suffice.

Now indexing the documents in such a manner is necessary for quick look up but it cannot happen on its own, each document must be processed before entering into the inverted index.

To learn more about indexing in depth you can go through the following article Demystifying the Elastic search indexing and synonym search.

Tokenization:

The full text in the document is split into individual words and this process is carried out by the tokenizer. The tokenizer splits the text based on a delimiter that can be provided.

Example:

The above query in kibana produces the following output:

Normalization:

After we have tokenizer output, we can normalize each token/word by transforming and modifying before entering all tokens into the inverted index.

Example:

The above query in kibana produces the following output:

Finally, after both these processes have been completed the inverted index is updated.

Analyzer

In essence the analyzer performs analysis on a text input which results in tokens that may or may not be multiple depending on the input string.

Analyzers combine the tokenization and normalization aspect into 1 component which is applied to the documents.

The above is the summary of what the Elastic Search docs tells us about the analyzer. Let’s dive a bit more in depth to understand its functionality.

**Various stages of the Elastic Search Analyzer**

Let’s try to understand what each stage in the analyzer does:

1. Character Filters:

In this stage we remove unwanted characters from input text string.

2. Tokenizer:

This is the tokenization bit which we have explained above. This stage gives us each token/word in the input text.

3. Token Filter:

This is the normalization bit which we explained above. This stage further

processes the tokens provided by the tokenizer.

Elastic Search provides different analyzers that one can use depending on their own use cases like the standard_analyzer, simple_analyzer, whitespace_analyzer and many more which one can explore on the Elastic Search docs.

With this we have achieved a good grasp on analyzers in Elastic Search, why and how they are used. Now let’s move to normalizers and see how they differ from analyzers.

Normalizer

Since we have a good understanding of analyzers, we can try and understand normalizers by drawing similarities with them.

In very layman terms, normalizers have all the stages of the analyzer except the tokenizer, thus the output from normalizer will only be a single token that has gone through character filter and token filter. Another condition that must be satisfied for the usable character and token filter is that only those filters would apply which work on a single character and not on a token, because normalizers only emit a single token.

With so many restrictions on normalizers one very important question arises is that if normalizers are so restricted why even use them in the first place. The answer is very simple which is highlighted in the first line in the explanation of analyzers.

As mentioned, analyzers can only be used with text input fields. Thus, all the keyword input would not be analyzed while indexing.

Even when we create a new index by default Elastic Search does not use any normalizers thus keywords are indexed as is.

For the use cases that we wish to apply normalization on keywords before indexing we can instruct Elastic Search to use a normalizer of our choice.

Uses

Even though the above features make Elastic Search ideal for querying large amounts of data, it still retains flexibility in the manner that Elastic Search allows us to have custom analyzers and normalizers depending on how a developer may have search queries. We’ll explore some examples on how we can use custom normalizers and analyzers for different use cases.

1. Case sensitive and insensitive search / aggregations:

The user may want to put up aggregations on a certain field, depending on the requirement they may have. In such cases a normalizer comes in handy, we can have our own custom normalizer and have our index catered to the specific search capability we wish to serve. The same behavior is showcased below.

Let’s say in our index we have 3 documents each having only 1 attribute ‘id’ which store both as a text and a keyword. The image below shows these documents.

Now to achieve case insensitive search / aggregation capability we will use a custom normalizer with the name ‘lowercase_normalizer’ while creating the index with the following make.

Here the lowercase filter in the normalizer converts every keyword before indexing thus can achieve case insensitive search. A drawback here is since the values are converted to lowercase, their exact values are lost on keyword field, thus one must compare the pros and cons before going with what sort of normalizer they wish to employee.

The below result is the aggregations on the id attribute, thus showing that elastic search treats each keyword the same according to our normalizer.

In the above case I would also like to point out that by just removing the ‘lowercase’ filter whilst making the index we can achieve case sensitive search. This shows us how easy it is to customize Elasticsearch to fulfill our requirements.

2. Custom processing on incoming data:

Another requirement a developer may have been that in the incoming data they may need some specific special characters to be removed while retaining others or they may want to replace a character with another one.

Another feature of Elasticsearch is that it can remove all stop words as well that we can provide via a text file as well.

Elasticsearch gives us the ability to apply custom character filters which helps us in achieving the above results and many more.

Conclusion

Through the knowledge gained above we can effectively conclude that Elastic Search is not only a very powerful and fast search engine, but it also provides developers with the ability to customize their searches on the documents saved according to their needs. This customization packaged with a fast-paced search engine makes Elastic Search a very versatile and dynamic tool which opens endless possibilities for it to be utilized in the real world.