A Guide to Perform FullText Search in Elasticsearch

Sindhuri Kalyanapu
inspiringbrilliance
4 min readApr 17, 2020

Written by Krupa and Sindhuri Kalyanapu

In the previous part, we walked through a detailed example to help you move from MongoDB to ElasticSearch and get started with ElasticSearch mappings. In this post, we will delve into a few aspects of analyzers and Fulltext search.

Fulltext Search

Moving ahead to FTS. Search for documents containing a specific word or a part of a word(substring) is termed as full-text-search on documents. For example, to fetch all the user documents whose fullName contains ‘agarwal’ or partial word ‘joh’, fulltext search on the fullName is to be performed.

In ElasticSearch, by default, all the fields of the documents are indexed with a standard analyzer. Standard analyzer splits the text into words that are further converted to lowercase and while searching, it does an exact match on the word. For example, when the field value of the document is ‘john’ and search is performed for the word ‘john’, then the document is returned but searching for the value ‘joh’ yields no result.

But if partial matching is required, for example, if “joh” should also match the above document, it can be achieved by indexing the documents by other analyzers like n-gram analyzer or edge-ngram analyzer.

N-Gram Analyzer: If the fields(suppose fullName) in the documents are indexed by n-gram analyzer, it will split the text value of the field into tokens. For example, if the value of the field is ‘john’, then this analyzer splits this value into following tokens: ‘j’, ‘jo’, ‘joh’, ‘john’, ’o’, ’oh’, ’ohn’, ’hn’, ’h’, ’n’. This matches any of the above tokens.

Edge-ngram analyzer (prefix search) is the same as the n-gram analyzer, but the difference is it will only split the token from the beginning. For the same field value ‘john’, edge-ngram analyzer splits it into tokens ‘j’, ‘jo’, ‘joh’, ‘john’. Searching for ‘ohn’ would result in no matches.

Understanding both, your data and the features you are making available to your customers is crucial in determining the right analyzer. While Edge-ngram generates fewer permutations and requires lesser permutations, the analyzer which suits the requirement most is n-gram analyzer, the reason being we should be able to do partial match for a substring not only at the starting of the string but also in between the string.

Hence we have to change the mapping in the following way to enable partial matching of full-text search using n-gram analyzer.

curl — location — request PUT 'http://localhost:9200/users' \
— header 'Content-Type: application/json' \
— data-raw '{
"settings": {
"index.max_ngram_diff" : 10,
"analysis": {
"analyzer": {
"ngram_analyzer": {
"tokenizer": "ngram_tokenizer",
"filter": ["lowercase"]
}
},
"tokenizer":{
"ngram_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 10,
"token_chars": ["letter","digit"]
}
}
}
},
"mappings": {
"dynamic": "true",
"dynamic_templates": [{
"anything": {
"match": "*",
"mapping”: {
"index": true,
"type": "text",
"analyzer": "ngram_analyzer"
}
}
}],
"properties": {
"email": {
"type": "text",
"index": true,
"analyzer": "ngram_analyzer"
},
"fullName": {
"type": "text",
"index": true,
"analyzer": "ngram_analyzer"
},
"gender": {
"type": "text",
"index": true,
"analyzer": "ngram_analyzer",
},
"mongoId": {
"type": "text",
"index": false
},
"login": {
"type": "Integer",
"index": false
}
}
}}’

In the above mapping, all fields (fullName, email, gender, and dynamic fields ) are indexed using n-gram analyzer as full text is to be performed on all the fields except mongoId and login fields.

Few aspects to watch out for:

  • min_gram is the minimum length of characters in a gram. The default value is 1. One should be careful while setting the values for min_gram. Single character and double character tokens will match so many things that the suggestions are often not helpful, especially when searching against a large dataset, so it is best practice to keep min_gram value to 3.
  • Max_gram is the maximum length of characters in a gram. The default value is 2. This value should be set according to the requirement which matches the best. Memory usage has to be taken into consideration while setting this value, it should also not be set to larger values. For the above example data, the required value is 10.
  • index.max_ngram_diff : The index level setting index.max_ngram_diff controls the maximum allowed difference between max_gram and min_gram. The default value is 1. If the difference is more index.max_ngram_diff has to be set explicitly.
  • In most cases, the same analyzer should not be applied both at index time and search time. It results in search conflicts and latencies. If we observe the mapping specified above we have specified two analyzers one during indexing(n-gram analyzer) and other is during search (standard analyzer)

FTS sample queries

The below query searches for the documents whose fullName field contains the word ‘ric’.

curl — location — request POST 'http://localhost:9200/users/_search' — header 'Content-Type: application/json' 
— data-raw '{
“query”: {
“match”: {
“fullName”: “ric”
}
}
}'

The below query searches for the documents where any field contains the word ‘ric’.

curl — location — request POST 'http://localhost:9200/index-name/_search'— header 'Content-Type: application/json'
— data-raw '{
“query”: {
“multi_match”: {
“query”: “full”,
“fields”: [“*”]
}
}
}'

In the next part, we will cover Aggregations in ElasticSearch

--

--