Full Text Search in Elasticsearch and Understanding of Analyzer — part 2

David Cargatser
4 min readSep 26, 2018

--

You can see first part from here

All of the following correct for Elasticsearch 6.4 — September 2018

Let’s look closely what we’ve done in part I

  1. We set settings —specified what analyzer we have for current index
  2. We put mappings — bound the field of index with analyzer
  3. Added data for search

Analyzer is a package with 3 subgroups

  • Character filters — zero or more
  • Tokenizer — exactly one
  • Token filter — zero or many

The text processed by character filter, after it passed to tokenizer to make tokens modified by token filter

I described in part 1 how you can build custom analyzer using 3 subgroups together.

Let’s see few more things

Index Time Analyzer vs Search Time Analyzer

Every search we can determine which exactly analyzer we want to use. By most of time you will want to use the analyzer from mappings. But what if we want something else? The elasticsearch determines the following priorities

  • An analyzer specified in the query itself. (highest)
  • The search_analyzer mapping parameter.
  • The analyzer mapping parameter.
  • An analyzer in the index settings called default_search.
  • An analyzer in the index settings called default.
  • The standard analyzer. (lowest)

It means that if we didn’t define any analyzers and used text field datatype we will find ourselves with standard analyzer (the last option)

Note we still have tokens in index! We will not reindex our tokens if we define another one analyzer. The elastic search engine will generate search tokens by the rules we describe earlier and apply these tokens to existing ones. So let’s define analyzer in search time

curl -X GET "localhost:9200/books/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"match" : {
"author" : {
"query" : "Aleksandr Solzhenitsyn",
"analyzer" : "keyword"
}
}
}
}
'

What will happen? Firstly the query will be analyzed by analyzer from the search — keyword, than we will get a tokens. Well in this particular case the token actually. Let’s check it

curl -X POST "localhost:9200/books/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
"analyzer": "keyword",
"text": "Aleksandr Solzhenitsyn"
}
'

and result

{
"tokens" : [
{
"token" : "Aleksandr Solzhenitsyn",
"start_offset" : 0,
"end_offset" : 22,
"type" : "word",
"position" : 0
}
]
}

After this this token will be used in search in our index-time tokens. You can see them in part I. Naturally we will not find anything.

{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}

Fields

Ok, but what if we do want to find match in the last search? How can we do it? One of the way is to define two or more analyzers for field in the mappings. We can do it by field keyword

curl -X PUT "localhost:9200/books/_mapping/_doc?pretty" -H 'Content-Type: application/json' -d'
{
"properties": {
"author": {
"type": "text",
"analyzer": "my_custom_ngram_analyzer",
"fields": {
"raw": {
"type": "keyword"
},
"keyword_analyzered_field" : {
"type": "text",
"analyzer" : "keyword" // any analyzer your wish
}
}

}
}
}
'

Here we defined two additional rules for author field. One that we can get keyword by author.raw and second that we can get all tokens of keyword analyzer by author.keyword_analyzered_field. Let’s add data and try

curl -X GET "localhost:9200/books/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"match" : {
"author.raw" : "Aleksandr Solzhenitsyn"
}
}
}
'
"hits" : {
"total" : 1,
"max_score" : 0.2876821,
"hits" : [
{
"_index" : "books",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.2876821,
"_source" : {
"author" : "Aleksandr Solzhenitsyn"
}
}
]
}

We are define where we are looking for by dot and field (.raw for example)

So this will fail because we don’t have match for tokens

curl -X GET "localhost:9200/books/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"match" : {
"author" : {
"query" : "Aleksandr Solzhenitsyn",
"analyzer" : "keyword"
}
}
}
}
'
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}

and this not — because we have exactly match for token

curl -X GET "localhost:9200/books/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"match" : {
"author.raw" : {
"query" : "Aleksandr Solzhenitsyn",
"analyzer" : "keyword"
}
}
}
}
'
"hits" : {
"total" : 1,
"max_score" : 0.2876821,
"hits" : [
{
"_index" : "books",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.2876821,
"_source" : {
"author" : "Aleksandr Solzhenitsyn"
}
}
]
}

so as this not because with this analyzer we still have basically the same token

curl -X GET "localhost:9200/books/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"match" : {
"author.keyword_analyzered_field" : {
"query" : "Aleksandr Solzhenitsyn",
"analyzer" : "keyword"
}
}
}
}
'
"hits" : {
"total" : 1,
"max_score" : 0.2876821,
"hits" : [
{
"_index" : "books",
"_type" : "_doc",
"_id" : "1",
"_score" : 0.2876821,
"_source" : {
"author" : "Aleksandr Solzhenitsyn"
}
}
]
}

Conclusion

Full text search allowed by analyzer. We define in settings what it is and map the field from data model to analyzer in mappings. Elasticsearch creates tokens in index time. It’s possible to search by another tokens during search time by defining analyzer or search_analyzer. It’s also possible to make zero or more analyzers to text field and if none defines the standard analyzer will be used.

--

--