Introduction to Elasticsearch Queries

Published in

elasticsearch

7 min readOct 12, 2019

Phase 03 — Elasticsearch queries intro — Blog 11

In the last blog, we have seen how Kibana can be used as a dev tool and how sample data can be loaded using Kibana.
From this blog, we will be looking in to the query DSL of Elasticsearch, which is quite powerful and an indispensable field of knowledge for any Elasticsearch user.

Elasticsearch query types

The queries in Elasticsearch can be broadly classified into two categories,

1. The leaf queries

Leaf queries look for specific values in certain field/fields. These queries can be used independently. Some of these queries include match, term, range queries.

2. The compound queries

Compound queries uses the combination of leaf/compound queries. Basically, they combine multiple queries together to achieve their target results.

The broad classification of the two queries is roughly shown in the below diagram:

As you can see in the above picture, there are still many categories inside the Leaf and the compound classification. We will visit most of the queries/query types in the above figure in much more detail in the coming blogs.

Basic query samples

Now let us familiarize ourselves with 2 basic queries of the leaf and one queriy from the compound query type to get things started.

1. The simple “match” query

Suppose we consider the documents indexed in the previous blog, let us try a simple match query on the field “first_name” for the search keyword “Dany”. The query will look like below:

POST employees/_search
{
 “query”: {
 “match”: {
 “country”: “China”
 }
 }
}

The above query will return us all the documents which has the country given as China

2. Range query

Now let us fire another query, this one too a leaf query. This query should return us all the employees who have salary greater than or equal to 500,000. This can be achieved using a range query as below:

POST employees/_search
{
"query": {
  "range": {
    "salary": {
      "gte": 500000
    }
  }
}
}

3. Bool query

Now comes the interesting part. How can we compare the above queries?. That is, I need all the employees who are from China, but earning more than 500,000.
This requires a combination of the above two leaf queries. Now Elasticsearch provides the facility for combining these queries using the bool query. Let us discuss the general structure of the bool query and then get back to the problem.

Bool query general structure:

POST _search
{
  "query": {
    "bool": {
      "must": [...],
      "filter": [...],
      "must_not": [...],
      "should": [...]
    }
  }
}

must: The clause (query) must appear in matching documents and will contribute to the score.

filter: The clause (query) must appear in matching documents. However unlike must the score of the query will be ignored.

should: The clause (query) should appear in the matching document.

must_not: The clause (query) must not appear in the matching documents.

Now coming back to our problem, our bool query for returing all the employees from China and earning more than 500,000 would look like below:

POST employees/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "country": "China"
          }
        },
        {
          "range": {
            "salary": {
              "gte": 500000
            }
          }
        }
      ]
    }
  }
}

Now, let us consider if we want to filter out all the male employees from the list. What should we do?. Just adding a must_not section in the above query with condition gender: “Male” would do the job, as below:

POST employees/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "country": "China"
          }
        },
        {
          "range": {
            "salary": {
              "gte": 500000
            }
          }
        }
      ],
      "must_not": [
        {
          "match": {
            "gender": "Male"
          }
        }
      ]
    }
  }
}

The query context and the filter context

Elasticsearch, by default, while returning the search results, would sort them based on their relevance score, which indicates how well the document matches the query. This relevance score is computed and returned in the _score parameter of the metadata with each result.

This is by default a positive floating point number.

The _score computation techniques might be different for different type of queries. That is the score computation for “match” query might be different from that of the “span” query.

But most importantly, the score computation depends on what context the query clause was ran. That is the query clauses can be ran in the “query” context or in the “filter” context.

Query context

When a clause is executed in the query context, it looks for “how well the document matched to the query”. The more the match, the higher the score would be.

An example would be as given in the screenshot below:

Here in the above example, I have searched for “Director of” in the field “title”. This query has returned me a few results, looking the result 1, the title has matched exactly the query clause (meaning, the keywords “Director” and “of”) was present in that. The score of the 1st document is therefore, higher 7.363

Where as in the 2nd document only one of the keywords in the query clauses matched (only the “Director” has matched in the 2nd document) and hence the score is less (5.305 ) compared to the score of the first document.

So the first document has matched more well compared to the second document and that is clearly reflected in the _score metadata of both the documents.

This is what happens when the query clauses are given in the query context.

Filter context

When the query clause is given in the filter context, it just looks if the documents contains the clause of not. That is effectively a true/false returning. Suppose we are querying the data in the filter context, by asking whether the document field gender matches “Male”, we will get only the matching documents, with no score.

Unlike the query context, the filter context does not use time to compute the scores, and hence filter context return faster results.

The above example of filter context involving the filtering by gender can be shown in the below picture:

In the above example, you can see that when applied in the filter context, the scores for the resulting documents returned 0.

Revisiting the bool query.

Considering the above context, it is time to revisit the bool query.

In the bool query, the must and should sections, executes in the query context, meaning the clauses in the must section computes scores.

Where as the must_not and the should sections in the bool query executes the query clauses in the filter contexts, and will not affect scoring.

To demonstrate, let us try to same set of query clauses in must section first and then apply one clause in must and then one in the filter section and then see how the score variations.

Case 1 : Both clauses inside the “must” section

As you can see, in the above query, both the clauses are in the same must condition and the score returned by the 1st result’s document is 2.4333658 (in the right panel)

Case 2 : One clause is moved to the filter section

Let us move the gender clause the filter section in the bool query as below and then run the query.

Now in the right panel, looking at the scores, you can see that the score has dropped to 1.7261622, which means that only the clausein the must section was computed for scoring and the clause in the filter section was not used for scoring.

To confirm this we can run the above query, only with the must section clause and see if it is returning the same score.

As predicted, you can see from the above picture, the score remains the same, even with the filter clauses removed from the query.

Conclusion

In this blog, we have just familiarised with the classification of Elasticsearch queries, the context of the queries and some of the most basic queries.

From the next blog on wards we will look in to detail, on each type of the query with more examples and data sets.