Elasticsearch: Optimizing your search(2)

Ying Ray Lu
DeepQ Research Engineering Blog
7 min readMar 1, 2018
“Neatly organized cutting board with hamburger, cheese, and vegetable fixings” by Elli O. on Unsplash

In the previous article, we have acquired an initial understanding how to handle the documents with Elasticsearch. However, if you want to improve the ability of search, especially mandarin, we still have lots of efforts needed to address. In this topic, I’d like to talk about the “term” of Elasticsearch and share some solutions for searching improvement.

Inverted Index

Inverted index, is the good structure of Elasticsearch. It is usually designed to allow very fast full-text searches. To understand the inverted index, let me give you the example from the official website, we first create 2 documents, each with a content field containing the following:

  1. The quick brown fox jumped over the lazy dog.
  2. Quick brown foxes leap over lazy dogs in summer.
PUT /ii_example/_doc/1
{
"content": "The quick brown fox jumped over the lazy dog."
}
PUT /ii_example/_doc/2
{
"content": "Quick brown foxes leap over lazy dogs in summer."
}
GET /ii_example/_doc/_search

Terms

At index time, the built-in english analyzer would convert this sentence of each document into separate words (which we call terms, or tokens):

Term      Doc_1  Doc_2
-------------------------
quick | X | X
brown | X | X
dog | X |
dogs | | X
fox | X |
foxes | | X
in | | X
jumped | X |
lazy | X | X
leap | | X
over | X | X
summer | | X
------------------------

Now, if we want to search for a lazy dog, Elasticsearch queries the terms of lazy and dog, and then finds out the mapping documents.

Query Terms   ===>   Get Document's ID

The request body:

GET /ii_example/_doc/_search
{
"query": {
"match": {
"content": "a lazy dog"
}
}
}

This same analysis process is applied to the query string at search time. The query text will be converted into terms of the same form as those that are stored in the inverted index. The result we find will be like:

Term      Doc_1  Doc_2
-------------------------
lazy | X | X
dog | X |
------------------------
Total | 2 | 1
{
"_index": "ii_example",
"_type": "_doc",
"_id": "1",
"_score": 0.5753642,
"_source": {
"content": "The quick brown fox jumped over the lazy dog."
}
},
{
"_index": "ii_example",
"_type": "_doc",
"_id": "2",
"_score": 0.2876821,
"_source": {
"content": "Quick brown foxes leap over lazy dogs in summer."
}
}

Queries for term level: Term level queries

Analysis

Analysis is the process of converting text into terms which are added to the inverted index for searching.

Analysis is performed by an analyzer which can be either a built-in analyzer or a custom analyzer defined per index. In addition, analyzer is just a package which contains 3 lower-level building blocks: character filters, tokenizers, token filter.

Analyzer Pipeline

Character filters (Optional)

A character filter receives the original text as a stream of characters and can transform the stream by adding, removing, or changing characters. For instance, a character filter could be used to convert mandarin numerals (零一二三‎) into their Arabic-Latin equivalents (0123), or to strip HTML elements like <b> from the stream.

Let’s use the html_strip char filter which is the build-in analyzer:

# Try it first!GET _analyze
{
"tokenizer": "whitespace",
"text": "<h1>Quick</h1> brown fox!"
}
# Then add the char_filter!GET _analyze
{
"tokenizer": "whitespace",
"char_filter": [ "html_strip" ],
"text": "<h1>Quick</h1> brown fox!"
}

Tokenizers (Must have exactly one)

A tokenizer receives a stream of characters, breaks it up into individual tokens (usually individual words), and outputs a stream of tokens(terms).

# whitespace tokenizer
"Quick brown fox!" ===> [Quick, brown, fox!]

Elasticsearch provides the analyze API for testing and viewing the terms which produced by analyzer. Therefore, let’s use the API to test the whitespace tokenizer:

GET _analyze
{
"tokenizer": "whitespace",
"text": "Quick brown fox!"
}

And the response body will be like this:

{
"tokens": [
{
"token": "Quick",
"start_offset": 0,
"end_offset": 5,
"type": "word",
"position": 0
},
{
"token": "brown",
"start_offset": 6,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "fox!",
"start_offset": 12,
"end_offset": 16,
"type": "word",
"position": 2
}
]
}

As the result, the tokenizer is also responsible for recording the order or position of each term and the start and end character offsets of the original word which the term represents.

Token filters (Optional)

A token filter receives the token stream and may add, remove, or change tokens. For example, a lowercase token filter converts all tokens to lowercase, a stop token filter removes common words (stop words) like the from the token stream, and a synonym token filter introduces synonyms into the token stream.

GET _analyze
{
"tokenizer": "whitespace",
"filter": ["lowercase"],
"text": "The Quick brown fox!"
}
GET _analyze
{
"tokenizer": "whitespace",
"filter": ["lowercase", "stop"],
"text": "The Quick brown fox!"
}

Let’s try some cool stuff!

Customize the analyzer

Usually, when we create the new index, we also set up the analysis functions from settings, customize the analyzer and the important things, don’t forget setting up the properties’ analyzer.

PUT my_cool_index
{
"settings": {
"analysis": {
"analyzer": {
"cool_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"asciifolding"
]
}
}
}
}
}
PUT my_cool_index
{
"mappings": {
"_doc": {
"properties": {
"my_field": {
"type": "text",
"analyzer": "cool_analyzer"
}
}
}
}
}

Also, it can be referred to when running the analyze API on a specific index:

GET my_cool_index/_analyze 
{
"analyzer": "std_folded",
"text": "Is this déjà vu?"
}
GET my_cool_index/_analyze
{
"field": "my_text",
"text": "Is this déjà vu?"
}

Customize the synonym token filter

Synonyms can replace existing tokens or be added to the token stream. Before using the synonym token filter, we also need to set the synonyms at the analyzer.

PUT my_synonym_index
{
"settings": {
"analysis": {
"analyzer": {
"synonym_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_synonym_filter"
]
}
},
"filter": {
"my_synonym_filter": {
"type": "synonym",
"synonyms": [
"british, english",
"queen, monarch"
]
}
}
}
}
}
PUT my_synonym_index/_doc/_mappings
{
"properties": {
"my_field": {
"type": "text",
"analyzer": "synonym_analyzer"
}
}
}

Testing our analyzer with the analyze API shows the following:

GET my_synonym_index/_analyze
{
"analyzer" : "synonym_analyzer",
"text" : "Elizabeth is the English queen"
}

Also, let’s test our synonym_analyzer by query:

PUT my_synonym_index/_doc/1
{
"my_field": "Elizabeth is the English queen"
}
GET my_synonym_index/_doc/_search
{
"query": {
"match": {
"my_field": "british"
}
}
}

Mandarin search

Photo by Andrew Haimerl on Unsplash

If your service provides the searching with mandarin, you might understand that is not easy to solve the problem which is the process of analysis. In this part, I select ik-analyzer which is well-known plugin and is well-integrated with Elasticsearch.

Installation

Following the previous topics, let’s reuse the same environment and execute docker exec into the container, install the plugin of ik-analyzer. After installing successfully, we restart the container.

docker exec -it handsonelasticsearch_elasticsearch_1 bash \
./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.2.1/elasticsearch-analysis-ik-6.2.1.zip

Play with IK-analyzer

Before trying the analyzer, let’s test the standard analyzer with analyze API for mandarin first.

GET _analyze 
{
"analyzer": "standard",
"text": "本日的Elasticsearch教程為(如何最佳化您的搜索。)"
}

The elasticsearch-analysis-ik provides 2 analyzers: ik_smart and ik_max_word. Let’s try:

GET _analyze 
{
"analyzer": "ik_smart",
"text": "本日的Elasticsearch教程為(如何最佳化您的搜索。)"
}

and

GET _analyze 
{
"analyzer": "ik_max_word",
"text": "本日的Elasticsearch教程為(如何最佳化您的搜索。)"
}

Add the synonym token filter

Let’s try some different configuration of synonym token filter. This time, we create a file name synonyms.txt and put it into the path of config/analysis. Follow with the previous installation with docker, I have made the config folder map to the container’s folder. For the reason, we can edit the synonyms.txt outside.

# synonyms.txt
ipod, i-pod, i pod => ipod
番茄, 西紅柿
台大 => 台灣大學

Then, we create an index with setting up the analyzer:

PUT mandarin
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"mandarin_analyzer": {
"tokenizer": "ik_smart",
"filter": [
"synonym"
]
}
},
"filter": {
"synonym": {
"type": "synonym",
"synonyms_path": "analysis/synonyms.txt"
}
}
}
}
}
}
PUT mandarin/_doc/_mappings
{
"properties": {
"my_field": {
"type": "text",
"analyzer": "mandarin_analyzer"
}
}
}

After setting up the analyzer, we can try:

GET mandarin/_analyze
{
"analyzer": "mandarin_analyzer",
"text": "番茄"
}

Conclusion

In this topic, we understood the inverted index which is the important part of Elasticsearch. Then we knew what the term is, and how Elasticsearch recognizes the terms. Finally, we played the built-in analyzers, created several custom_analyzers and install the plugins to fulfill our requirements. To make a deeper understanding of analyzers and plugins, I provide some useful tools and documents for your reference below.

Also, You can find the full source code of my Github repo. Thanks for reading, and hopefully this tutorial is helpful for you! :D

--

--

Ying Ray Lu
DeepQ Research Engineering Blog

Life can only be understood backwards, but it must be lived forwards.