Custom analyzer building in Elasticsearch

Phase 02 — indexing, mapping and analysis — Blog 09

Published in

elasticsearch

4 min readDec 14, 2017

Introduction

In the previous blog in this phase, I have explained more about the general analyzer structure and components. Also I have explained what the function of each component is. In this blog we will see the implementation side, by building a custom analyzer and then querying and seeing the difference.

Case for a custom analyzer

So let us consider a case for our custom analyzer. Suppose the text we are inputting to Elasticsearch contains the following

1. html tags
html tags can occur in our text while indexing, and this is actually not needed in most cases. So we need to remove those.
2. stopwords
Words like the,and,or etc, are less significant when it comes to searching the content and are generally called as stopwords3. upper case characters.
4. short forms like H2O, $,%
In some cases, the short forms like these should be replaced with their original English words.

A sample case for text include inf the above cases will be like below:
Arun has 100 $ which accounts to 3 % of the total <h2> money </h2>

So let us see how this text is analyzed , parsed and indexed in Elasticsearch using custom analyzers.

Applying the custom analyzer

In the above sample text, the operations that need to be done and the corresponding components for the custom analyzer are given in the below table

Custom analyzers are built in based on the requirements like above. Here we have a total of 5 things to be taken care of as shown in the above table.

In the index settings of the elasticsearch, there is a section called “analysis”,under the “settings” object, where we can specify the individual components, which we are going to use. Now under another section called “analyzers” under the “analysis”,we name our custom analyzer/analysers and inside each custom analyzer, we pick the components we want to be used.

After this, during mapping, we give the name of this custom analyzer under the “analyzer” field under the required field to make the text in that field undergo required analysis.

The hierarchy in the “settings” would look like below:

Applying all the components

Now applying all the above components to create a custom analyzer, will look like below:

curl -XPUT localhost:9200/testindex_0204 -d '{
  "settings": {
    "analysis": {
      "char_filter": {
        "subsitute": {
          "type": "mapping",
          "mappings": [
            "$=> dollar",
            "%=> percentage"
          ]
        },
        "html-strip": {
          "type": "html_strip"
        }
      },
      "tokenizer": "standard",
      "filter": {
        "stopwords_removal": {
          "type": "stop",
          "stopwords": [
            "has",
            "which",
            "to",
            "of",
            "the"
          ]
        }
      },
      "analyzer": {
        "custom_analyzer_type_01": {
          "type": "custom",
          "char_filter": [
            "subsitute",
            "html_strip"
          ],
          "tokenizer": "standard",
          "filter": [
            "stopwords_removal",
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "test_type": {
      "properties": {
        "text": {
          "type": "string",
          "analyzer": "custom_analyzer_type_01"
        }
      }
    }
  }
}'

This will create the index with the custom analyzer, named “custom_analyzer_01”.
This mapping is explained in detail, with each section explained in the below figure:

Token generation using the custom analyzer

The tokens generated by using this analyzer can be seen by using the analyze API, like below:

curl -XGET "localhost:9200/testindex_0204/_analyze?analyzer=custom_analyzer_type_01&pretty=true" -d 'Arun has 100 $ which accounts to 3 % of the total <h2> money </h2>'

The token list is as below:

Here you can make some observations:

The token number 3 and 6 were originally $ and %, but it was then replaced by "dollar" and "percentage" as we specified in the char_filter section.
Also the html tags <h2> and </h2> are also removed from the token list by the html_strip filter
The terms "to","the","which","has" etc which was mentioned in the stopwords filter has been removed from the token list.
The token number 1 should originally look like "Arun", but has been lowercased by our filter applied.

Conclusion

In this blog we have seen how a custom analyzer is built and applied to a field in Elasticsearch. With this blog, I intend to conclude the phase 2 of the blog series (indexing,mapping and analysis). This phase is one of the foundational pieces in understanding Elasticsearch as from now on we might be using the inputs from this phase for many purposes. From Phase 03, I will be introducing you to the query DSL world of elasticsearch.