Elasticsearch — Analyzers, Tokens, Filters

Señorita Developer
Oct 4, 2021 · 5 min read

What are Elasticsearch’s Analyzers, Tokens, Filters and How to Implement Custom Ones

Photo by little plant on Unsplash

I needed to convert Zimbra’s indexing flow to a modern version (with Spring Boot and newer dependency versions) for an e-mail archive project so I had a chance to learn more about analyzers, tokenizers and filters.

First, I came across “createMappingInfo” method from where I hopped to custom analysis folder and I had to convert them to my versions.

I created my index mapping and setting config files first (final versions are below).

Setting file is where we define analyzers, tokenizers and filters.

Mapping file defines fields (“properties”) that will be in the document. You will notice “analyzers” being assigned to fields here. There are also “store” and “norms” and I will also give information about them in another post.

IndexCreationService (given below) uses these configurations to create index with the name as IndexConstants.INDEX_NAME constant value. IndexConstants.MAIL_ITEM_INDEX_MAPPING and IndexConstants.MAIL_ITEM_INDEX_SETTING define the paths for these files.

What is “Analyzer”?

Analyzer” defines us how the field value should be indexed and searched. It helps to “analyze fields” with the help of given instructions (that are combination of tokenizers and filters).

There are built-in analyzers Elasticsearch provides. You can have a look at the list here:

I used “whitespace” and “standard_cjk” with filter “cjk_width”. CJK was used to convert CJK characters and was obviously aimed for Chinese, Japanese or Korean language mail content (Zimbra’s implementation is here).

You can write your custom analyzers with Elasticsearch’s built-in tokenizers/filters or again with your custom tokenizers/filters.

You can see configuration parameters and an example here:

What is “Tokenizer”?

A “tokenizer” breaks the field value into parts called “tokens” according to a pattern, specific characters, etc.

Just like analyzers, Elasticsearch has built-in tokenizers. The list is as follows:

You can see “addr_char” and “filename_char” as custom tokenizers defined under “tokenizer” in index setting file.

If you want to break up according to your regex, you can set “type” as “pattern” and define your regex in “pattern”.

"addr_char": {
"type": "pattern",
"pattern": "(\\s+)|([<>,'\"]+)|(\\)+)|(\\(+)|(]+)|(\\[+)"
}

If you want to break up when a special character occurs, you can set “type” as “char_group” and give the list of your special characters in “tokenize_on_chars”.

"filename_char" : {
"type": "char_group",
"tokenize_on_chars": [
",",
" ",
"\r",
"\n",
"."
]
}

You can find details about these two types under under “Structured Text Tokenizers” title in the Elasticsearch site link provided above.

There is also a language specific tokenizer using stopwords defined for that language (like “if, because, so, and, …” for English) which may be useful:

What is “Filter”?

There are two types of filters called “Token Filter” and “Character Filter”. “Token Filter” applies changes after tokenization whereas “Character Filter” applies changes before tokenization.

What is “Token Filter”?
Token filter receives tokens from tokenizers and performs given operations on them (like converting to lowercase or removing specific characters/words, etc.).

You can check Elasticsearch Token Filter reference (all types are listed in the navigation menu on the right side).

If you do not want to “tokenize” but convert your input to something else (like all lowercase characters) you can use “Keyword tokenizer” combining with token filters.

I also used “icu-folding” which is one of built-in ones.

You have to install “analysis-icu plugin” to Elasticsearch if you want to use ICU (International Components for Unicode).

bin/elasticsearch-plugin install analysis-icu

The ICU Analysis plugin integrates the Lucene ICU module into Elasticsearch, adding extended Unicode support using the ICU libraries, including better analysis of Asian languages, Unicode normalization, Unicode-aware case folding, collation support, and transliteration.

In my index setting file, under “filter” you can see I used “pattern_replace” to replace “.” with “”.

"contact_token" : {
"type" : "pattern_replace",
"pattern": ".",
"replacement": "",
"script" : {
"source" : "token.getTerm().length() > 1"
}
}

You may have also noticed a “script” definition. This is for “conditional token filter”ing according to the given script. The script above tells to apply filter if token length is more than one character.

What is “Character Filter”?
A “Character filters” operates on characters before the value is passed to the tokenizer for tokenization. They are usually used to convert language specific letters to ASCII or get rid of unwanted characters (add/remove/change characters).

Happy Coding!

turkcell

Connect to life with Turkcell!

turkcell

Turkcell is a converged telecommunication and technology services provider, founded and headquartered in Turkey. It serves its customers with voice, data, TV and value-added consumer and enterprise services on mobile and fixed networks.

Señorita Developer

Written by

I would love to change the world, but they won’t give me the source code | coding 👩🏼‍💻 | coffee ☕️ | jazz 🎷 | anime 🐲 | books 📚 | drawing 🎨

turkcell

Turkcell is a converged telecommunication and technology services provider, founded and headquartered in Turkey. It serves its customers with voice, data, TV and value-added consumer and enterprise services on mobile and fixed networks.