Elasticsearch Query optimisation techniques for relevant search results — Part 1

Published in

CodeX

6 min readFeb 5, 2022

Elasticsearch, as the name implies, is primarily used for its incredibly fast search capabilities. But have you ever wondered how scoring happens in the background and why some documents score higher than others? In this two-part series, we will first focus on how to construct simple to complex search queries. Following that, we will explain the calculations that occur behind the scenes using the explain API.

Search Results Ranking Basics:

Elasticsearch previously used TF-IDF as their default similarity algorithm, but shifted to BM25 (Best Matching) with the introduction of Lucene 6.

Here’s a simple explanation of how search results are scored, based on the following criteria:

Term Frequency (TF): The number of times the search term appears in the document.
Document length: The length of the document containing the search term versus the average length of the documents in the search results.
Inverse Document Frequency (IDF): The ratio of the number of documents that contain a value for the search field to the number of documents containing the search term in the field that is being searched. In simpler terms, it measures the rarity of the search term in a document. The score is higher if rare search terms are present.
Elasticsearch sums up the scores computed by all search terms.

Learning with a scenario:

Let us assume we have a books catalogue (books index) where category, a short description and a long description are primarily used for finding relevant books. We start by writing a simple search query to find the relevant books from the catalogue.

{
 “title”: “Unlocking Android”,
 “isbn”: “1933988673”,
 “pageCount”: 416,
 “categories”: [
 “Open Source”,
 “Mobile”
 ],
 “shortDescription”: “Unlocking Android: A Developer’s Guide provides concise, hands-on instruction for the Android operating system and development tools. This book teaches important architectural concepts in a straightforward writing style and builds on this with practical and useful examples throughout.”,
 “longDescription”: “Android is an open source mobile phone platform based on the Linux operating system and developed by the Open Handset Alliance, a consortium of over 30 hardware, software and telecom companies that focus on open standards for mobile devices. Led by search giant, Google, Android is designed to deliver a better and more open and cost effective mobile experience. Unlocking Android: A Developer’s Guide provides concise, hands-on instruction for the Android operating system and development tools. This book teaches important architectural concepts in a straightforward writing style and builds on this with practical and useful examples throughout. Based on his mobile development experience and his deep knowledge of the arcane Android technical documentation, the author conveys the know-how you need to develop practical applications that build upon or replace any of Androids features, however small. Unlocking Android: A Developer’s Guide prepares the reader to embrace the platform in easy-to-understand language and builds on this foundation with re-usable Java code examples. It is ideal for corporate and hobbyists alike who have an interest, or a mandate, to deliver software functionality for cell phones. WHAT’S INSIDE: * Android’s place in the market * Using the Eclipse environment for Android development * The Intents — how and why they are used * Application classes: o Activity o Service o IntentReceiver * User interface design * Using the ContentProvider to manage data * Persisting data with the SQLite database * Networking examples * Telephony applications * Notification methods * OpenGL, animation & multimedia * Sample Applications “,
 “status”: “PUBLISH”,
 “authors”: [
 “W. Frank Ableson”,
 “Charlie Collins”,
 “Robi Sen”
 ]
}

Note: Step by step guide on how to reproduce the queries have been provided at the end of the article.

Let’s begin our search journey:

Let us start our search journey by trying to find the word “Open Search” in the books index.

Scenario 1: Simple search:

Search for the word “Open Search” in the field named title.

GET books/_search
{
  "query": {
    "term": {
      "title": {
        "value": "Open Search"
      }
    }
  },
  "highlight": {
    "fields": {
      "title": {}
    }
  }
}

The search would give 100+ hits for the keyword “Open Search”. It might seem relevant at an initial glance but none of them is relevant since they don’t talk about Lucene (which is a famous Open Source based search engine).

In reality, the word “Open source” can also be referenced in the fields short description and long description and this was not even considered in the search query. In the upcoming section, we will be taking a look at how different fields influence the search results.

Scenario 2: Adding more fields to the query:

Extending Scenario 1, now let us say we want to search the same word phrase “Open Search” in multiple fields: categories, longDescription, shortDescription.

GET books/_search
{
  "explain": false,
  "query": {
    "bool": {
      "should": [
        {
          "multi_match": {
            "query": "Open Search",
            "fields": [
              "shortDescription",
              "longDescription",
              "categories.keyword"
            ],
            "type": "phrase"
          }
        }
      ]
    }
  },
  "highlight": {
    "fields": {
      "shortDescription": {},
      "longDescription": {},
      "categories.keyword": {}
    }
  }
}

You notice a considerable change in results with a total of 36 matched documents having a max_score of 8.118633. To our surprise, the document that was on top in Scenario 1 would be placed somewhere in the middle.

Note: The type “phrase” takes the max from every field and returns the document with the highest score.

Scenario 3: Boosting our Search:

In certain use cases, we may want to influence the result by giving more preference to certain fields. To achieve this we make use of the boost functionality. In simple terms, giving more weightage to selective fields for better search results.

GET books/_search
{
  "explain": false,
  "query": {
    "bool": {
      "should": [
        {
          "multi_match": {
            "query": "Open Search",
            "fields": [
              "shortDescription",
              "longDescription",
              "categories.keyword"
            ],
            "type": "phrase"
          }
        }
      ]
    }
  },
  "highlight": {
    "fields": {
      "shortDescription": {},
      "longDescription": {},
      "categories.keyword": {}
    }
  }
}

Note: Boosting can be done by making use of the (^) caret symbol followed by the boost score.

Scenario 4: Extending it further:

Now let us say we also want to return partial matches for the word “Open Search” in addition to the exact match.

GET books/_search
{
  "explain": true,
  "query": {
    "bool": {
      "should": [
        {
          "multi_match": {
            "query": "Open Search",
            "fields": [
              "shortDescription",
              "longDescription",
              "categories.keyword⁵⁰"
            ],
            "type": "phrase"
          }
        },
        {
          "multi_match": {
            "query": "Open Search",
            "fields": [
              "shortDescription",
              "longDescription",
              "categories⁴⁰"
            ],
            "type": "most_fields"
          }
        }
      ]
    }
  },
  "highlight": {
    "fields": {
      "shortDescription": {},
      "longDescription": {},
      "categories": {},
      "categories.keyword": {}
    }
  }
}

If you notice, the max_score would be 530.05334 for the document with ISBN 1933988177 (Lucene in Action, Second Edition) for the topmost document.

The results would vary depending on the type selected and it is important to understand every type before starting to write complex search queries. For example, if we had chosen the “best_fields” instead of “most_fields” the document with ISBN 1933988673 would have returned. This is because “best_fields” makes use of the max of the search results whereas “most_fields” makes use of the max of the search results. (Other Supported types)

Summary:

In summary, delving deeper into the tuning of Elasticsearch (ES) queries and field boosting reveals the significance of each parameter. When developing your search queries, consider the following:

Keyword field types are not designed for partial matches and can negatively affect search results. Use keyword field types only when searching for a single word to perform an exact match.
Keyword fields should denote unique word limits. Overusing them can lead to incorrect search results.
Only boost fields when absolutely necessary. Over-boosting can yield undesired results.
The use of field types in match queries should align with specific needs and requirements.
It’s often better to filter first, then perform a search operation.
Use a function score to reduce the score and relevance of older documents.

In Part 2, we will be focusing on explaining what is happening under the hood by using the explain API.

Exercise:

Step 1: Download the following sample books dataset from the following link

Step 2: Create an index called books with the following mappings as given below. For article sake, we have created mappings only for fields against which search will be performed.

PUT books
{
  "mappings": {
    "properties": {
      "categories": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "longDescription": {
        "type": "text"
      },
      "shortDescription": {
        "type": "text"
      }
    }
  }
}

Note: It is important to create explicit mapping for better performance and less overhead.

Step 3: Upload the JSON into elasticsearch using the bulk API and start your search query journey