How to fix NGram Tokenizer autosuggestion with Edge NGram Tokenizer

Published in

Dev Ticks

2 min readMay 24, 2017

This article will show you how to solve a frequent problem when developers start using NGram Tokenizer. If you came here without reading the article about NGram Tokenizer, I recommend to read it first here.

The NGram Tokenizer is the perfect solution for developers that need to apply a fragmented search to a full-text search. But as we move forward on the implementation and start testing, we face some problems in the results.

For example, if we have the following documents indexed:

Document 1, Document 2 e Mentalistic

Searching by “ment” fragment, it will return all entries, because it is present in all indexes. But we need to keep in mind that in some auto-suggestion features the user usually start writing the word from the beginning and depending on your query it can return documents with the word “Document”, for example, with a higher score than “Mentalistic”. Knowing that in this case, we should return the “Mentalistic” word with a higher score or even do not return the other documents.

Aiming to solve that problem, we will configure the Edge NGram Tokenizer, which it is a derivation of NGram where the word split is incremental, then the words will be split in the following way:

Mentalistic:
 [Ment, Menta, Mental, Mentali, Mentalis, Mentalist, Mentalisti]Document: 
 [Docu, Docum, Docume, Documen, Document]

Considering the following configuration of Edge NGram:

{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "edge_ngram",
          "min_gram": 4,
          "max_gram": 10,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  }
}

Next, when we execute a search query in ElasticSearch using the field "edgengram", the results will be the following:

{
  “took”: 1,
  “timed_out”: false,
  “_shards”: {
    “total”: 5,
    “successful”: 5,
    “failed”: 0
  },
  “hits”: {
    “total”: 1,
    “max_score”: 0.088273734,
    “hits”: [
      {
        “_index”: “edgengramtest”,
        “_type”: “document”,
        “_id”: “AVw7QkzR4b6ynYg_AMHR”,
        “_score”: 0.088273734,
        “_source”: {
          “ID”: 3,
          “title”: “Mentalistic”
        }
      }
    ]
  }
}

Making more sense for an auto-suggestion mechanism where the user starts writing words from the beginning.

To summarize, the Edge NGram serves as a complement of the NGram itself. It can replace or be used together, evidencing it with a simple boost in the field.

If you want to know more about it, you can check at: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenizer.html

How to fix NGram Tokenizer autosuggestion with Edge NGram Tokenizer

Written by Ricardo Heck