Elasticsearch Query Optimization for improving search relevance — Part II

Published in

CodeX

10 min readApr 25, 2022

In Part I, we learnt how to draft a series of search queries for various scenarios starting with a simpler one leading to much more complex ones. In this article, our focus will shift toward the explain API and how it can help us in writing efficient search queries and also in understanding the internals of the Elasticsearch scoring algorithm.

By the end of this article, you will be able to:

Understand the key metrics mentioned in the explain API output.
Comprehend how scoring functions in Elasticsearch search queries.
Use the explain API to write everything from simple to complex search queries.
Refine your search query based on the explain metrics.

Recap:

In Part I, we created a book catalogue that allowed users to search for relevant books based on certain criteria.

An example of book records looks like the following:

{
  "title": "Designing Hard Software",
  "isbn": "133046192",
  "pageCount": 350,
  "publishedDate": "1997-02-01T00:00:00.000-0800",
  "shortDescription": "\"This book is well written ... The author does not fear to be controversial. In doing so, he writes a coherent book.\" --Dr. Frank J. van der Linden, Phillips Research Laboratories",
  "longDescription": "Have you ever heard, \"I can't define a good design but I know one when I see it\"  Designing Hard Software discusses ways to develop software system designs that have the same tangibility and visibility as designs for hard objects like buildings or computer hardware. It emphasizes steps called \"essential tasks\" which result in software specifications that show how each requirement, including robustness and extensibility, will be satisfied. All software developers and managers seeking to develop \"hard\" software will benefit from these ideas.    There are six essential tasks necessary for a good design:    User (run-time) requirements  Development sponsor (build-time) requirements  Domain information  Behavior identification and allocation  Behavior description  Software system architecture  Designing Hard Software goes beyond the standard software development methodologies such as those by Booch, Rumbaugh, Yourdon, and others, by providing techniques for a complete system architecture as well as explicit measures of the goodness of design. So, \"you define a good design.\"",
  "status": "PUBLISH",
  "authors": [
    "Douglas W. Bennett"
  ],
  "categories": [
    "Object-Oriented Programming",
    "S"
  ]
}

Note: All records used for this example are available here

Books have been indexed using the below mapping:

PUT books
{
  "mappings": {
    "properties": {
      "categories": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "longDescription": {
        "type": "text"
      },
      "shortDescription": {
        "type": "text"
      }
    }
  }
}

Note: In the interest of focus, we have created mappings only for the fields for which search will be performed.

Overview of Elasticsearch scoring algorithm:

Elasticsearch used the TF-IDF as their default similarity algorithm and has shifted to BM25 (Best Matching) ever since the introduction of Lucene 6.

Given a query Q with keywords, q1…..qn, the BM25 similarity score is defined as (not for the faint-hearted) :

Where:

f(qᵢ,D) : term frequency of qᵢ in document D
|fieldLen| : length of document in words
k1 : helps determine term frequency saturation parameter and has a default value of 1.2. It limits how much a single query term can affect the score of a given document
b : length normalization parameter. Higher the value of b, the effects of the document length compared to average length is amplified. Defaults to 0.75
avgFieldLen : average document length in the text collection
IDF : inverse document frequency defined as :

f(qᵢ) : number of documents containing qᵢ
docCount : total number of documents in the collection

Tf (term frequency) is calculated by combining

freq / (freq + k1 * (1 — b + b * fieldLen / avgFieldLen))

For more information on the algorithm and more on BM25

Let’s the search begin:

In continuation of the previous article, we will be finding the word Open Source in the books index. The search will be focusing on the three fields: categories, longDescription, shortDescription.

GET books/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "multi_match": {
            "query": "Open Source",
            "fields": [
              "shortDescription",
              "longDescription",
              "categories.keyword"
            ],
            "type": "phrase"
          }
        }
      ]
    }
  },
  "highlight": {
    "fields": {
      "shortDescription": {},
      "longDescription": {},
      "categories.keyword": {}
    }
  }
}

We get 36 matched documents with the max_score of 8.118633.

Sample Response:

{
  "took": 746,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 36,
      "relation": "eq"
    },
    "max_score": 8.118633,
    "hits": [
      {
        "_index": "books",
        "_type": "_doc",
        "_id": "TjEgUYABjdRWGf7Bqv0J",
        "_score": 8.118633,
        "_source": {
          "title": "Subversion in Action",
          "isbn": "1932394478",
          "pageCount": 356,
          "publishedDate": "2004-12-01T00:00:00.000-0800",
          "thumbnailUrl": "https://s3.amazonaws.com/AKIAJC5RLADLUMVRPFDQ.book-thumb-images/machols.jpg",
          "shortDescription": "Learn all about this new open source version control application and why it is replacing CVS as the standard. Examples demonstrate how to customize features to deal with day-to-day problems.",
          "longDescription": "A new-generation version control tool, Subversion is replacing..",
          "status": "PUBLISH",
          "authors": [
            "Jeffrey Machols"
          ],
          "categories": [
            "Java"
          ]
        },
        "highlight": {
          "longDescription": [
            "A new-generation version control tool, Subversion is replacing the current <em>open</em> <em>source</em> standard, CVS."
          ],
          "shortDescription": [
            "Learn all about this new <em>open</em> <em>source</em> version control application and why it is replacing CVS as the standard"
          ]
        }
      }
     ...
    ]
  }
}

Let us deep dive further and try to understand the ranking of the results. By adding the “explain”: true option to the search query, Elasticsearch will provide detailed information on the scoring computed for the search result.

GET books/_search
{
  "explain": true,
  "query": {
    "bool": {
      "should": [
        {
          "multi_match": {
            "query": "Open Source",
            "fields": []
          }
        },
        ...
      ]
    }
  }
}

Below is the output for the search result with the explanation. Don’t be alarmed by the response because we will be breaking it down for better understanding in the upcoming sections.

"_explanation": {
  "value": 8.118633,
  "description": "max of:",
  "details": [
    {
      "value": 3.591908,
      "description": """weight(longDescription:"open source" in 152) [PerFieldSimilarity], result of:""",
      "details": [
        {
          "value": 3.591908,
          "description": "score(freq=1.0), computed as boost * idf * tf from:",
          "details": [
            {
              "value": 2.2,
              "description": "boost",
              "details": []
            },
            {
              "value": 3.1677296,
              "description": "idf, sum of:",
              "details": [
                {
                  "value": 1.7381384,
                  "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                  "details": [
                    {
                      "value": 41,
                      "description": "n, number of documents containing term",
                      "details": []
                    },
                    {
                      "value": 235,
                      "description": "N, total number of documents with field",
                      "details": []
                    }
                  ]
                },
                {
                  "value": 1.4295912,
                  "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                  "details": [
                    {
                      "value": 56,
                      "description": "n, number of documents containing term",
                      "details": []
                    },
                    {
                      "value": 235,
                      "description": "N, total number of documents with field",
                      "details": []
                    }
                  ]
                }
              ]
            },
            {
              "value": 0.51541185,
              "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
              "details": [
                {
                  "value": 1.0,
                  "description": "phraseFreq=1.0",
                  "details": []
                },
                {
                  "value": 1.2,
                  "description": "k1, term saturation parameter",
                  "details": []
                },
                {
                  "value": 0.75,
                  "description": "b, length normalization parameter",
                  "details": []
                },
                {
                  "value": 136.0,
                  "description": "dl, length of field (approximate)",
                  "details": []
                },
                {
                  "value": 191.19148,
                  "description": "avgdl, average length of field",
                  "details": []
                }
              ]
            }
          ]
        }
      ]
    },
    {
      "value": 8.118633,
      "description": """weight(shortDescription:"open source" in 152) [PerFieldSimilarity], result of:""",
      "details": [
        {
          "value": 8.118633,
          "description": "score(freq=1.0), computed as boost * idf * tf from:",
          "details": [
            {
              "value": 2.2,
              "description": "boost",
              "details": []
            },
            {
              "value": 6.77204,
              "description": "idf, sum of:",
              "details": [
                {
                  "scoring": "omitted for brevity"
                }
              ]
            },
            {
              "value": 0.54493,
              "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
              "details": [
                {
                  "scoring": "omitted for brevity"
                }
              ]
            }
          ]
        }
      ]
    }
  ]
}

Understanding the Scoring from Search Results:

In this section, we will try to understand how the scoring actually works under the hood.

To begin with, consider the first document with isbn as 1932394478 from the above search result. If we notice in the explanation, the word Open Source has been found in the fields longDescription and shortDescription

As we know,

   tf = freq / (freq + k₁ * (1 — b + b * fieldLen / avgFieldLen))

In our case, the values for doc with isbn : 1932394478, the PerFieldSimilarity of the word phrase Open Source in longDescription is as follows:

1. freq = 1 since Open Source occurs only once in this document.

2. k₁ = 1.2; default value

3. b = 0.75; default value

4. dl : 32; since there are six words in the longDescription field in this document

5. avgdl : 53.82192 ; this is the average length of the longDescription field in shard [books][0]

Plugging the values into the formula; we arrive at tf = 0.51541185

With tf, calculated next step is to calculate the idf (Inverse Document Frequency) using the formulae:

              idf = log(1 + (N — n + 0.5) / (n + 0.5))

For the word Open, N = 235 (in shard [books][0]) and n = 41; we arrive at idf = 1.7381384

For the word Source, N = 235 (in shard [books][0]) and n = 56; we arrive at idf = 1.4295912

Summing the scores gives the idf as 3.1677296

The default boost score is set as 2.2 and comes from the value (k1 + 1) as specified in the BM25 formula.

So, the final score in our case for this search phrase in longDescription is:

    tf * idf * (k1+1) = 0.51541185 * 3.1677296 * 2.2 = 3.591908

Similarly, the scores are calculated for each and every field given in the search param and the scores for shortDescription is:

       tf * idf * (k1+1) = 0.54493 * 6.77204 * 2.2 = 8.118633

Since we are using the type phrase the result takes the max from every field and returns the highest scored document.

Boosting Our Search:

Sometimes we would want to give more preference to certain fields over the others. For such scenarios, we need to make use of the boosting functionality and try to influence the search result.

Now let us apply boosting to the categories field and see how it impacts the search results.

GET books/_search
{
 "query": {
   "bool": {
     "should": [
       {
         "multi_match": {
           "query": "Open Source",
           "fields": [
             "shortDescription",
             "longDescription",
             "categories.keyword^50"
           ],
           "type": "phrase"
         }
       }
     ]
   }
 },
 "highlight": {
   "fields": {
     "shortDescription": {},
     "longDescription": {},
     "categories.keyword": {}
   }
 }
}

Sample Response

{
  "took": 15,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 36,
      "relation": "eq"
    },
    "max_score": 250.57788,
    "hits": [
      {
        "_index": "books",
        "_type": "_doc",
        "_id": "tjEgUYABjdRWGf7BqvwI",
        "_score": 250.57788,
        "_source": {
          "title": "Unlocking Android",
          "isbn": "1933988673",
          "pageCount": 416,
          "publishedDate": "2009-04-01T00:00:00.000-0700",          
      ...
    ]
  }
}

If you notice the final score would be score * additional boost = 250.57788 for the top most document. The result also indicates that the top most document from the previous search query is different as the boosting has impacted the search results.

This above scenario is likely to be used where users might want to give more weightage to selective fields for better search results.

Extending it further

So far we have been returning search results based on matching phrases but what if we want to return partial matching results as well.

GET books/_search
{
  "explain": true,
  "query": {
    "bool": {
      "should": [
        {
          "multi_match": {
            "query": "Open Source",
            "fields": [
              "shortDescription",
              "longDescription",
              "categories.keyword^50"
            ],
            "type": "phrase"
          }
        },
        {
          "multi_match": {
            "query": "Open Source",
            "fields": [
              "shortDescription",
              "longDescription",
              "categories^40"
            ],
            "type": "most_fields"
          }
        }
      ]
    }
  },
  "highlight": {
    "fields": {
      "shortDescription": {},
      "longDescription": {},
      "categories": {},
      "categories.keyword": {}
    }
  }
}

Sample Response

{
  "took": 27,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 61,
      "relation": "eq"
    },
    "max_score": 530.05334,
    "hits": [
      {
        "_shard": "[books][0]",
        "_node": "PiILupiCSs21etzGXXfN0g",
        "_index": "books",
        "_type": "_doc",
        "_id": "ETEgUYABjdRWGf7Bqv0J",
        "_score": 530.05334,
        "_source": {
          "title": "Lucene in Action, Second Edition",
          "isbn": "1933988177",
          ...
    ]
  }
}

The max_score for the topmost document would be 530.05334 for the document with isbn 1933988177.

The results would vary depending on the type selected and it is important to understand every type before starting to write complex search queries.

For example, if we had chosen the type as best_fields instead of most_fields the document with isbn 1933988673 would have returned.

This is because best_fields makes use of the max of the search results whereas most_fields makes use of the max of the search results.

For more information on different support types:

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-multi-match-query.html#multi-match-types

Summary:

Writing search queries can be complex. However, understanding how scoring works can help achieve the desired results.

Here is a guideline for writing search queries:

Step 1 — Identify the Requirement: Determine the expected search results.

Step 2 — Planning: After identifying the requirements, plan by identifying the fields for which the search query will be performed.

Step 3 — Boosting: Give preference to certain fields using the boost function. Be cautious in choosing the boost value as over-boosting can lead to undesired results.

Step 4 — Writing the Query: Write the actual query considering Steps 1 to 3.

Step 5 — Explain API: Use the explain API to understand the result and make necessary adjustments.

Step 6 — Repeat: If the result is not satisfactory, return to Step 2 and repeat the process until the expected result is achieved.

For a better understanding, use a sample dataset and follow these guidelines to check if you can achieve the desired ranking and search results.

This two-part series aims to help you understand how to write search queries and the Elasticsearch internal scoring algorithm.

Keep the following in mind while developing your search queries:

Keyword field types are not intended for partial matches as they can negatively impact search results. Use keyword field types only when searching for a single word for an exact word match.
Limit the use of keyword fields as overutilization can lead to improper search results.
Only boost fields when absolutely necessary. Over-boosting can lead to undesired results.
Use field types in match queries based on the need and requirement.
It’s often better to filter first, then perform a search operation.
Use the function score to reduce the score and relevance of older documents.

References:

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-explain.html

Elasticsearch Query Optimization for improving search relevance — Part II

Recap:

Overview of Elasticsearch scoring algorithm:

Let’s the search begin:

Sample Response:

Understanding the Scoring from Search Results:

Boosting Our Search:

Sample Response

Extending it further

Sample Response

Summary:

References:

Written by Varun Subramanian