Anatomy of an N-Gram Search

Full-text search is an extremely useful tool for many applications. One popular feature is partial word searching. To search for partial words, we can store variable-length sequences of words in a search engine. Another word for a variable-length sequence of a word is an n-gram.

If we break up the word “sun” into n-grams, we get: s, su, sun, u, un, n. If we store these n-grams in a search engine and associate them all to the word sun, we’ll be able to find any documents containing the word sun by searching for any of these n-grams.

Elasticsearch is a great search engine we can use to perform an n-gram search. To install Elasticsearch, you can follow the installation instructions or use homebrew.

Document Tokenization

A token is a sequence of characters separated by delimiters. Any character or sequence of characters can be used as a delimiter, but for our purposes we’ll say that any non-letter character is a delimiter. The first step in indexing a document is extracting its tokens.

After the tokenization, the tokens can be run through filters to alter them further. One common filter converts all uppercase characters to lowercase characters. This is necessary for case-insensitive searches.

According to these rules, tokenizing “How now brown cow” produces the tokens:

how
now
brown
cow

And tokenizing the email address “user@example.com” produces the tokens:

user
example
com

Elasticsearch provides this type of tokenization along with a lowercase filter with its lowercase tokenizer.

N-Gram Filtering

Now that we have tokens, we can break them apart into n-grams. To accomplish this with Elasticsearch, we can create a custom filter that uses the ngram filter.

Minimum gram size

We’ll want to set a minimum gram size of 1, this is necessary for a query like “john d”. A gram size greater than 1 would result in the “d” token not being indexed.

If we specified a minimum of 2, we would get matches for any john with this query, because there would not be any n-grams composed of the single letter “d”.

Maximum gram size

The maximum gram size must also be specified, however, it should be kept to a small number. The reason being the larger the n-grams, the more space required to store them.

As an example, I’ve seen 70k documents, each with 6 n-gram filtered text fields of at most 30 characters, with a maximum gram size of 8, result in an index size of 44mb. Setting the maximum gram size to 10, increased the index size to 51mb.

Running the following command will create an index with the tokenization we’ve discussed:

curl -X PUT "http://localhost:9200/users" -d '{
"settings": {
"analysis": {
"filter": {
"ngram_filter": {
"type": "nGram",
"max_gram": 8,
"min_gram": 1
}
},
"analyzer": {
"ngram_index_analyzer": {
"tokenizer": "lowercase",
"filter": ["ngram_filter"],
"type": "custom"
}
}
}
}
}'

Partial Word Queries

Now that we’ve configured the index, we need to properly prepare the query.

For one, we’ll need to convert the query to lowercase tokens, again using the lowercase tokenizer.

Also, because we are limiting the size of grams we index, we’ll need to truncate all tokens in the query to the maximum gram size. If we don’t do this, searches for tokens greater than the maximum gram size will return no results. Elasticsearch makes this easy to do with the truncate filter.

Now we have all the settings we need to tokenize and filter documents and queries.

curl -X PUT "http://localhost:9200/users" -d '{
"settings": {
"analysis": {
"filter": {
"ngram_filter": {
"type": "nGram",
"max_gram": 8,
"min_gram": 1
},
"truncate_filter": {
"type": "truncate",
"length": 8
}
},
"analyzer": {
"ngram_index_analyzer": {
"tokenizer": "lowercase",
"filter": ["ngram_filter"],
"type": "custom"
},
"ngram_search_analyzer": {
"tokenizer": "lowercase",
"filter": ["truncate_filter"],
"type": "custom"
}
}
}
},
"mappings": {
"user": {
"properties": {
"email": {
"type": "string",
"index_analyzer": "ngram_index_analyzer",
"search_analyzer": "ngram_search_analyzer"
},
"name": {
"type": "string",
"index_analyzer": "ngram_index_analyzer",
"search_analyzer": "ngram_search_analyzer"
}
}
}
}
}'

Lastly, we want to use an “and” operator for all tokens in a search. Why we want to do this is so a search for “john d” only returns results for johns with a lastname containing the letter “d”.

This also allows for interesting results when searching for email addresses. Searching for “john doe” for instance, will return a match for “johndoe@example.com”. This is because the “johndoe” token extracted from the email address matches both “john” and “doe”.

Let’s now add a document for John Doe to the index:

curl -X POST "http://localhost:9200/users/user" -d '{
"email" : "johndoe@example.com",
"name" : "John Doe"
}'

And search by email:

curl "http://localhost:9200/users/user/_search?pretty=1" -d '{
"query": {
"match": {
"email": {
"query": "john doe",
"operator": "and"
}
}
}
}'

Or search by name:

curl "http://localhost:9200/users/user/_search?pretty=1" -d '{
"query": {
"match": {
"name": {
"query": "john d",
"operator": "and"
}
}
}
}'"

To Conclude

Examining the anatomy of an n-gram search under a microscope reveals pretty simple concepts. Applying these concepts in practice, with technology, can be a bit daunting for the first time. The hope of this post is to impart some knowledge about how it works, but also explain the implications of the smaller details.