How to fix NGram Tokenizer autosuggestion with Edge NGram Tokenizer

Ricardo Heck
Dev Ticks
Published in
2 min readMay 24, 2017

This article will show you how to solve a frequent problem when developers start using NGram Tokenizer. If you came here without reading the article about NGram Tokenizer, I recommend to read it first here.

The NGram Tokenizer is the perfect solution for developers that need to apply a fragmented search to a full-text search. But as we move forward on the implementation and start testing, we face some problems in the results.

For example, if we have the following documents indexed:

Document 1, Document 2 e Mentalistic

Searching by “ment” fragment, it will return all entries, because it is present in all indexes. But we need to keep in mind that in some auto-suggestion features the user usually start writing the word from the beginning and depending on your query it can return documents with the word “Document”, for example, with a higher score than “Mentalistic”. Knowing that in this case, we should return the “Mentalistic” word with a higher score or even do not return the other documents.

Aiming to solve that problem, we will configure the Edge NGram Tokenizer, which it is a derivation of NGram where the word split is incremental, then the words will be split in the following way:

Mentalistic:
[Ment, Menta, Mental, Mentali, Mentalis, Mentalist, Mentalisti]
Document:
[Docu, Docum, Docume, Documen, Document]

Considering the following configuration of Edge NGram:

{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "edge_ngram",
"min_gram": 4,
"max_gram": 10,
"token_chars": [
"letter",
"digit"
]
}
}
}
}
}

Next, when we execute a search query in ElasticSearch using the field "edgengram", the results will be the following:

{
“took”: 1,
“timed_out”: false,
“_shards”: {
“total”: 5,
“successful”: 5,
“failed”: 0
},
“hits”: {
“total”: 1,
“max_score”: 0.088273734,
“hits”: [
{
“_index”: “edgengramtest”,
“_type”: “document”,
“_id”: “AVw7QkzR4b6ynYg_AMHR”,
“_score”: 0.088273734,
“_source”: {
“ID”: 3,
“title”: “Mentalistic”
}
}
]
}
}

Making more sense for an auto-suggestion mechanism where the user starts writing words from the beginning.

To summarize, the Edge NGram serves as a complement of the NGram itself. It can replace or be used together, evidencing it with a simple boost in the field.

--

--