Full Text Search in Elasticsearch and Understanding of Analyzer — part 1

All of the following correct for Elasticsearch 6.4 — September 2018

One of the most popular feature of elasticsearch is full text search. But what exactly it is?

Full text search is a search that can analyze your data and produce the most appropriate result much faster than most of databases and result by itself is much more proper to what we wish. I’ll bring few examples

Let’s say that that we’re looking for a book that we’ve seen few years ago. You don’t remember precisely the author of the book. Assume we are looking for Aleksandr Solzhenitsyn’s book. If you will search for Alexandr Solzhenitsyn, Aleksander Solzhenitsyn, Aleksandr Solgenitsyn, Aleksandr Soljenitsyn, Alexandr Solzhenitsin and etc you will find at least big problem with regular RDBMS or many NoSQL DB. The wildcard will help but it make fetching results too slowly for most databases. Meanwhile elastic will find in all cases Aleksandr Solzhenitsyn and will do it swift.

Another example could be filter to convert Hindu-Arabic numerals (٠‎١٢٣٤٥٦٧٨‎٩‎) into their Arabic-Latin equivalents (0123456789).


So how it happens? Full text search can be analyzed. We can analyze our text in index time and define what is “analyze” mean by analyzer. Analyzer analyzes the given text and produce tokens or terms that later will be used in search queries. We are using this tokens to find matches, for example, standard analyzer applied to text “John Doe, Andy Shooter and friends vs New York City!” will give the following tokens [john, doe, andy, shooter, and, friends, vs, new, york, city] and english analyzer to the same text will give [john, doe, andi, shooter, friend, vs, new, york, citi]. Pay attention that all words are lowercased, there is no any commas or exclamation mark and city became to citi as andy to andi. Now if I search for “Andy” I will get the following results


“hits” : [
{
“_index” : “cases”,
“_type” : “_doc”,
“_id” : “3”,
“_score” : 0.2876821,
“_source” : {
“title” : “John Doe, Andy Shooter and friends vs New York City!”
}
]

Let’s dig deeper with analyzer. Analyzer is just a package which contains three lower-level building blocks: character filters, tokenizers, and token filters. Standard analyzer or language (english) analyzer are just a group of tokenizer, character filters and token filters. We can built our own group and apply to text, let’s do it

Settings

curl -X PUT "localhost:9200/books/?pretty" -H 'Content-Type: application/json' -d'
{
"settings": {
"analysis": {
"analyzer": {
"my_custom_ngram_analyzer": {
"type": "custom",
"tokenizer": "my_tokenizer",
"filter": ["lowercase"]
}
},
"tokenizer": {
"my_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 4
}
}
}
}
}
'

Here “books” is a name of our index; In “analyzer” I define what is analyzer. I’m using custom analyzer that will produce tokens according to tokenizer and will filter every token according to filter’s array. I used ngram tokenizer that will split text to tokens with length started from 3 to 4. The whole request will create index with the given settings. We can check by simple type in browser “localhost:9200/books”

Now we have to guide to elastic to what fields apply analyzer, we’re doing it by mapping

Mapping

curl -X PUT "localhost:9200/books/_mapping/_doc?pretty" -H 'Content-Type: application/json' -d'
{
"properties": {
"author": {
"type": "text",
"analyzer": "my_custom_ngram_analyzer"
}

}
}
'

In this case we guide to elastic map field “author” with “my_custom_ngram_analyzer” analyzer.

Let’s add some data

Add data

curl -X PUT "localhost:9200/books/_doc/1?pretty" -H 'Content-Type: application/json' -d'
{
"author": "Aleksandr Solzhenitsyn"
}
'
curl -X PUT "localhost:9200/books/_doc/2?pretty" -H 'Content-Type: application/json' -d'
{
"author": "George Orwell"
}
'
curl -X PUT "localhost:9200/books/_doc/3?pretty" -H 'Content-Type: application/json' -d'
{
"author": "George R.R. Martin"
}
'

Now we have a text field author that should be tokenized and filtered.Before we start to query let’s see this tokens!

Tokens

curl -X POST "localhost:9200/books/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
"analyzer": "my_custom_ngram_analyzer",
"text": "Aleksandr Solzhenitsyn"
}
'

Result is

{
"tokens" : [
{
"token" : "ale",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "alek",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 1
},
{
"token" : "lek",
"start_offset" : 1,
"end_offset" : 4,
"type" : "word",
"position" : 2
},
{
"token" : "leks",
"start_offset" : 1,
"end_offset" : 5,
"type" : "word",
"position" : 3
},
{
"token" : "eks",
"start_offset" : 2,
"end_offset" : 5,
"type" : "word",
"position" : 4
},
{
"token" : "eksa",
"start_offset" : 2,
"end_offset" : 6,
"type" : "word",
"position" : 5
},
{
"token" : "ksa",
"start_offset" : 3,
"end_offset" : 6,
"type" : "word",
"position" : 6
},
{
"token" : "ksan",
"start_offset" : 3,
"end_offset" : 7,
"type" : "word",
"position" : 7
},
{
"token" : "san",
"start_offset" : 4,
"end_offset" : 7,
"type" : "word",
"position" : 8
},
{
"token" : "sand",
"start_offset" : 4,
"end_offset" : 8,
"type" : "word",
"position" : 9
},
{
"token" : "and",
"start_offset" : 5,
"end_offset" : 8,
"type" : "word",
"position" : 10
},
{
"token" : "andr",
"start_offset" : 5,
"end_offset" : 9,
"type" : "word",
"position" : 11
},
{
"token" : "ndr",
"start_offset" : 6,
"end_offset" : 9,
"type" : "word",
"position" : 12
},
{
"token" : "ndr ",
"start_offset" : 6,
"end_offset" : 10,
"type" : "word",
"position" : 13
},
{
"token" : "dr ",
"start_offset" : 7,
"end_offset" : 10,
"type" : "word",
"position" : 14
},
{
"token" : "dr s",
"start_offset" : 7,
"end_offset" : 11,
"type" : "word",
"position" : 15
},
{
"token" : "r s",
"start_offset" : 8,
"end_offset" : 11,
"type" : "word",
"position" : 16
},
{
"token" : "r so",
"start_offset" : 8,
"end_offset" : 12,
"type" : "word",
"position" : 17
},
{
"token" : " so",
"start_offset" : 9,
"end_offset" : 12,
"type" : "word",
"position" : 18
},
{
"token" : " sol",
"start_offset" : 9,
"end_offset" : 13,
"type" : "word",
"position" : 19
},
{
"token" : "sol",
"start_offset" : 10,
"end_offset" : 13,
"type" : "word",
"position" : 20
},
{
"token" : "solz",
"start_offset" : 10,
"end_offset" : 14,
"type" : "word",
"position" : 21
},
{
"token" : "olz",
"start_offset" : 11,
"end_offset" : 14,
"type" : "word",
"position" : 22
},
{
"token" : "olzh",
"start_offset" : 11,
"end_offset" : 15,
"type" : "word",
"position" : 23
},
{
"token" : "lzh",
"start_offset" : 12,
"end_offset" : 15,
"type" : "word",
"position" : 24
},
{
"token" : "lzhe",
"start_offset" : 12,
"end_offset" : 16,
"type" : "word",
"position" : 25
},
{
"token" : "zhe",
"start_offset" : 13,
"end_offset" : 16,
"type" : "word",
"position" : 26
},
{
"token" : "zhen",
"start_offset" : 13,
"end_offset" : 17,
"type" : "word",
"position" : 27
},
{
"token" : "hen",
"start_offset" : 14,
"end_offset" : 17,
"type" : "word",
"position" : 28
},
{
"token" : "heni",
"start_offset" : 14,
"end_offset" : 18,
"type" : "word",
"position" : 29
},
{
"token" : "eni",
"start_offset" : 15,
"end_offset" : 18,
"type" : "word",
"position" : 30
},
{
"token" : "enit",
"start_offset" : 15,
"end_offset" : 19,
"type" : "word",
"position" : 31
},
{
"token" : "nit",
"start_offset" : 16,
"end_offset" : 19,
"type" : "word",
"position" : 32
},
{
"token" : "nits",
"start_offset" : 16,
"end_offset" : 20,
"type" : "word",
"position" : 33
},
{
"token" : "its",
"start_offset" : 17,
"end_offset" : 20,
"type" : "word",
"position" : 34
},
{
"token" : "itsy",
"start_offset" : 17,
"end_offset" : 21,
"type" : "word",
"position" : 35
},
{
"token" : "tsy",
"start_offset" : 18,
"end_offset" : 21,
"type" : "word",
"position" : 36
},
{
"token" : "tsyn",
"start_offset" : 18,
"end_offset" : 22,
"type" : "word",
"position" : 37
},
{
"token" : "syn",
"start_offset" : 19,
"end_offset" : 22,
"type" : "word",
"position" : 38
}
]
}

It’s a large set of tokens. But all of them are lowercased and ngrammed by 3 and 4 letter tokens. The elasticsearch will use them to figure out the best match. Let’s check

curl -X GET "localhost:9200/books/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"match" : {
"author" : "Solgenitsyn"
}
}
}
'
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 2.8768208,
"hits" : [
{
"_index" : "books",
"_type" : "_doc",
"_id" : "1",
"_score" : 2.8768208,
"_source" : {
"author" : "Aleksandr Solzhenitsyn"
}

}
]
}
}

In _source we can see our data. We can see that elastic did found hit with a score result. Why? Because elastic will analyze the given string (“Solgenitsyn”) and apply analyzer from mapping to it. We can use test analyzer to see tokens

{
"tokens" : [
{
"token" : "sol",
"start_offset" : 0,
"end_offset" : 3,
"type" : "word",
"position" : 0
},
{
"token" : "solg",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 1
},
{
"token" : "olg",
"start_offset" : 1,
"end_offset" : 4,
"type" : "word",
"position" : 2
},
{
"token" : "olge",
"start_offset" : 1,
"end_offset" : 5,
"type" : "word",
"position" : 3
},
{
"token" : "lge",
"start_offset" : 2,
"end_offset" : 5,
"type" : "word",
"position" : 4
},
{
"token" : "lgen",
"start_offset" : 2,
"end_offset" : 6,
"type" : "word",
"position" : 5
},
{
"token" : "gen",
"start_offset" : 3,
"end_offset" : 6,
"type" : "word",
"position" : 6
},
{
"token" : "geni",
"start_offset" : 3,
"end_offset" : 7,
"type" : "word",
"position" : 7
},
{
"token" : "eni",
"start_offset" : 4,
"end_offset" : 7,
"type" : "word",
"position" : 8
},
{
"token" : "enit",
"start_offset" : 4,
"end_offset" : 8,
"type" : "word",
"position" : 9
},
{
"token" : "nit",
"start_offset" : 5,
"end_offset" : 8,
"type" : "word",
"position" : 10
},
{
"token" : "nits",
"start_offset" : 5,
"end_offset" : 9,
"type" : "word",
"position" : 11
},
{
"token" : "its",
"start_offset" : 6,
"end_offset" : 9,
"type" : "word",
"position" : 12
},
{
"token" : "itsy",
"start_offset" : 6,
"end_offset" : 10,
"type" : "word",
"position" : 13
},
{
"token" : "tsy",
"start_offset" : 7,
"end_offset" : 10,
"type" : "word",
"position" : 14
},
{
"token" : "tsyn",
"start_offset" : 7,
"end_offset" : 11,
"type" : "word",
"position" : 15
},
{
"token" : "syn",
"start_offset" : 8,
"end_offset" : 11,
"type" : "word",
"position" : 16
}
]
}

Now we can see how many tokens are the same!

Let’s see another example

curl -X GET "localhost:9200/books/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"match" : {
"author" : "Alexandr Soljentsin"
}
}
}
'
{
"took" : 6,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 3.4521852,
"hits" : [
{
"_index" : "books",
"_type" : "_doc",
"_id" : "1",
"_score" : 3.4521852,
"_source" : {
"author" : "Aleksandr Solzhenitsyn"
}

}
]
}
}

The same will of course work with other authors

curl -X GET "localhost:9200/books/_search?pretty" -H 'Content-Type: application/json' -d'
{
"query": {
"match" : {
"author" : "Jeorge Orvel"
}
}
}
'
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 3.1645029,
"hits" : [
{
"_index" : "books",
"_type" : "_doc",
"_id" : "2",
"_score" : 3.1645029,
"_source" : {
"author" : "George Orwell"
}

},
{
"_index" : "books",
"_type" : "_doc",
"_id" : "3",
"_score" : 2.0137746,
"_source" : {
"author" : "George R.R. Martin"
}

}
]
}
}

We can see that both George were founded and the score is much bigger to George Orwell.

see part II — https://medium.com/@cargatser/full-text-search-in-elasticsearch-and-understanding-of-analyzer-part-2-6163e7545748