Querying Elasticsearch documents — Part 2

Published in

Betacom

7 min readFeb 8, 2021

Introduction

In this article, we will continue to learn how to query documents in Elasticsearch. If you are not familiar with this topic, please make sure to check Querying Elasticsearch documents — Part 1 before reading this one.

In the first section we will cover bool queries and learn how to add the boolean logic to the research we do, whereas in the last two sections we will discuss join relationships and how to query that kind of documents.

Bool queries

The bool query is a compound query in which we can use the boolean logic to combine other queries. It is built using one or more boolean clauses, each one with a typed occurrence which can be one of the following:

“must” indicates that the query must appear in matching documents and will contribute to the final score;
“filter” says that the query must appear in matching documents and is executed in filter context, meaning that scoring is ignored;
“should” means that if the clause appears in the matching documents then their score is higher than the one of documents which don’t match the clause;
“must_not” indicated that the clause must not appear in the matching documents and is executed in filter context therefore scoring is ignored.

Note that the “must”, “must_not” and “filter” clauses correspond to the SQL “where” clause since there is no concept of score in SQL. For the same reason, the “should” clause has not an SQL correspondence.

There also is a minimum_should_match parameter which can be used to specify the number or percentage of “should” clauses that the returned documents must match. If the bool query includes at least one “should” clause and no “must” or “filter” clauses, the default value for this parameter is 1. Otherwise, its default value is 0.

Before digging into some examples, let’s create a new index named “recipe” and insert into it some documents. You can easily do it by running the request available here into the Kibana Dev Tool. The documents will represents recipes and have the following fields:

“created” contains the creation timestamp,
“description” is the recipe description,
“ingredients” is an array of ingredients, each of one containing the fields “name” and “quantity”,
“preparation_time_minutes” contains the minutes needed to prepare the recipe,
“ratings” is an array containing the recipe ratings,
“servings” is an objects containing the “max” and “min” fields which represents the maximum and minimum number of servings we can get with the recipe,
“steps” is an array of all the steps to follow to prepare the recipe,
“title” is the recipe name.

An example of bool query could be searching for recipes whose list of ingredients contains parmesan but not tuna and which will be ready in less than 15 minutes:

GET /recipe/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "ingredients.name": "parmesan"
          }
        }
      ],
      "must_not": [
        {
          "match": {
            "ingredients.name": "tuna"
          }
        }
      ],
      "filter": [
        {
          "range": {
            "preparation_time_minutes": {
              "lte": 15
            }
          }
        }
      ]
    }
  }
}

Now that you learned bool queries, you should know that the match query we studied in the previous article internally builds a bool query: the text provided is analyzed and the analysis process constructs a boolean query from the provided text. It means that a match query corresponds to a bool query of type “should” (“must” if the match query has "operator": "and") in which there is a term query for each word of the text we are looking for. Thus the following two queries are equivalent:

GET /recipe/_search
{
  "query": {
    "match": {
      "title": "pasta carbonara"
    }
  }
}
GET /recipe/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "term": {
            "title": "pasta"
          }
        },
        {
          "term": {
            "title": "carbonara"
          }
        }
      ]
    }
  }
}

Joining queries

Performing full SQL-style joins in a distributed system like Elasticsearch is prohibitively expensive. Instead, Elasticsearch offers two forms of join which are designed to scale horizontally: nested query and has_child and has_parent queries. In this section we will se them in detail.

The nested query is used for nested data types since they cannot be queried by boolean queries. The request parameters are:

“path”, which indicates which field contains the objects we are querying,
“query”, which contains the query we would like to perform.

Let’s create a new index named “department” and try some examples. You can find the request for index and documents creation here. The documents will have the following fields:

“name” is the department name,
“employees” is an array of nested objects representing the employees, each of them contains the fields “name”, “age”, “gender” and “position”.

Let’s now run a nested query to look for all the female employees which are interns:

GET /department/_search
{
  "query": {
    "nested": {
      "path": "employees",
      "query": {
        "bool": {
          "must": [
            {
              "match": {
                "employees.position": "intern"
              }
            },
            {
              "term": {
                "employees.gender.keyword": {
                  "value": "F"
                }
              }
            }
          ]
        }
      }
    }
  }
}

Note that if we add "inner_hits": {} to the previous query, we can see the nested queries results in the output.

The has_child and has_parent queries are based on a join field relationship which can be specified at index creation time in the mapping properties:

"fieldName": {
    "type": "join",
    "relations": {
        "parentFieldName": "childFieldName"
    }
}

Before writing an example, we need to delete the “department” index and re-create it defining the join relationship:

DELETE departmentPUT /department
{
  "mappings": {
    "properties": {
      "join_field": { 
        "type": "join",
        "relations": {
          "department": "employee"
        }
      }
    }
  }
}

Let’s now index some documents. We will first add the departments and then the employees for departments. You can find the requests to do it here. Please note that when indexing the child documents we need to specify the routing parameter since both parent and children documents have to be saved in the same shard.

There are two ways to query documents by parent. The first one is to specify the parent_id:

GET /department/_search
{
  "query": {
    "parent_id": {
      "type": "employee", # child field name
      "id": 1 # parent document id
    }
  }
}

The second way is using the has_parent query. It allows you to specify the criteria to look for the parent document and returns its children. For example, we can look for all the employees in the “Development” department by using this request:

GET /department/_search
{
  "query": {
    "has_parent": {
      "parent_type": "department", # parent field name
      "query": {
        "term": {
          "name.keyword": "Development"
        }
      }
    }
  }
}

Please note that the score of the parent research is ignored by default, but you can ask to count it by adding "score": true to your request.

There is obviously a has_child query which returns the parent of the documents matching the given query. For example, let’s look for the department that has employees older than 50 who should be males:

GET /department/_search
{
  "query": {
    "has_child": {
      "type": "employee",
      "query": {
        "bool": {
          "must": [
            {
              "range": {
                "age": {
                  "gte": 50
                }
              }
            }
          ],
          "should": [
            {
              "term": {
                "gender.keyword": "M"
              }
            }
          ]
        }
      }
    }
  }
}

The relevance score of the child query can be taken into account in different ways. We can specify which way to use via the score_mode parameter:

“min” considers the minimum score,
“max” considers the maximum score,
“sum” sums of all scores,
“avg” computes the score average,
“none” (default) means no score is added.

We can also specify the minimum (maximum) number of children that have to match the query in order to insert their parent into the result. It can be done using the min_children (max_children) parameter.

Multi-level relationships

Let’s start with an example.

Suppose we have a company containing departments which in turn have employees. The company is also linked to different suppliers, as shown in the picture. In order to represent it in Elasticsearch, we first need to create a new index “company” and define all the relations we need. You can find all the requests here.

Now that the index is defined and documents are created, we can run an example query to look for companies that have a department in which a person named John Doe works:

GET /company/_search
{
  "query": {
    "has_child": {
      "type": "department",
      "query": {
        "has_child": {
          "type": "employee",
          "query": {
            "term": {
              "name.keyword": "John Doe"
            }
          }
        }
      }
    }
  }
}

Please note that using multiple levels of relations to replicate a relational model is not recommended since each level adds an overhead at query time in terms of memory and computation.

Join limitations and performances

The limitations we can encounter using the join relationship can be summarized as follows.

The documents we want to join have to be on the same index. If it wans’t like that, queries would have very low performances.
All children documents have to be indexed on the same shard as their parent.
It is possible to have at most one join field per index. That should not be a problem since we can map as many join relationships as we need for that field. Please remember that a child document can be added only after the parent document has been indexed.
Each document can have multiple children but at most one parent.

Join relationships are expensive and should be avoided when possible. The cost in term of query execution obviously depends on the amount of data we are working with: the more documents the lower the queries are executed.

A situation where using a join field is a good choice is when you have a 1 to many relationship between two kinds of documents and there are much more documents of one type than the other.

Conclusion

You know learnt two types of query: the boolean and the join ones. You should now be able to query your data in different ways.

As always, we recommend to spend some time writing queries in order to exercise yourself and be confident with them.