Querying Elasticsearch documents — Part 1

Published in

Betacom

6 min readJan 25, 2021

Introduction

In this article you will start learning how to search data in Elasticsearch. We will introduce the basic of querying data and go through some example, in order to make things easier to understand.

If you are not familiar with the Elasticsearch search engine, we recommend you to take a look at our previous articles, which are all available at the Betacom page.

Query DSL

In Elasticsearch it is possible to look for data using the Search API, which returns search hits that match the query defined in the request. The request body accepts queries written in Query DSL, a Domain Specific Language based on JSON. The basic query structure is the following:

GET indexName/_search
{
  "query": {
    "queryName": {
      "fieldName": {…}
    }
  }
}

The response body contains the following properties:

“_scroll_id”: search identifier;
“took”: milliseconds it took Elasticsearch to execute the request;
“timed_out” (boolean): if true, the request timed out before completion;
“_shards”: count of shards used for the request;
“hits”: object which contains the returned documents and their metadata.

The Query DSL consist of two types of clauses: leaf queries clauses and compound queries clauses. The first ones look for a particular value in a given field, whereas the second ones are composed by other leaf or compound queries combined in a logical fashion.

Relevance score

By default, Elasticsearch sorts search results according to the relevance score, which measures how well each document matches a query. The algorithm used to compute the score is Okapi BM25. The measures needed to evaluate the score (TF, IDF and field-length norm) are computed and stored in Elasticsearch at document indexing time, i.e. when the document is created or updated.

Recall that it used to be common practice to remove stop-words when analyzing text fields because they didn’t provide any clues for calculating the relevance score. However, such practice is no longer used because even though the value of the stop-words is limited, they still have some value. For the standard analyzer the relevant algorithm then needs to handle this because otherwise we would see the weight of stop-words being boosted artificially for large fields that contain many of them.

It is possible to change the score used to retrieve documents, but this practice is for advanced users only and usually you won’t need to dive in such details. For further information check Function score query.

Please note that the score will not be computed on all the documents of the index, but only on the ones in the shard used to execute the query.

To see how the score is evaluated, you can add "explain": true to the body of the request:

GET indexName/_search
{  
  …
  "explain": true
}

The Explain API shows why a document (doesn’t) satisfies a given query. We only need to specify the document id as follows:

GET indexName/id/_explain
{
 "query": … 
}

While each query type can calculate relevance scores differently, score calculation also depends on which context the query clause is run:

Queries executed on query context answer the question “How well does this document match the query?”.
Queries run on filter context answer the question “Does documents match this query?”. The answer is a simple Yes or No and the score is not calculated.

Term-level queries

Term-level queries are case sensitive queries used to find documents based on precise values in structured data.

Before digging into some examples, let’s create a “product” index and insert into it some documents. You can do it by running into Kibana Dev Tool the request available here. The documents will have the following fields:

“name” : product name,
“price” : product price,
“in_stock” : number of products available in stock,
“sold” : number of products sold,
“tags” : array of tags linked to the product,
“description” : product description,
“is_active” : boolean value saying if the product is active or not,
“created” : product creation day.

Try to run the following example queries on the index we just created!

A term query returns documents that contain an exact term in a provided field. We can for example look at all the active products in our index:

GET /product/_search
{
  "query": {
    "term": {
      "is_active": true
    }
  }
}

The terms query is the same as the term query, except we can search for multiple values. An example could be searching for products which have specifics tags:

GET /product/_search
{
  "query": {
    "terms": {
      "tags.keyword": [ "Soup", "Cake" ]
    }
  }
}

An ids query returns documents based on their _id field. The query request looks like the following:

GET /product/_search
{
  "query": {
    "ids": {
      "values": [ 1, 2, 3 ]
    }
  }
}

The range query looks for documents that contain terms within a provided range. We can use it for both numeric and date fields. For example we can look for documents having the “in_stock” value between 1 and 5:

GET /product/_search
{
  "query": {
    "range": {
      "in_stock": {
        "gte": 1,
        "lte": 5
      }
    }
  }
}

Date fields can be searched using the field date format:

GET /product/_search
{
  "query": {
    "range": {
      "created": {
        "gte": "2010/01/01",
        "lte": "2010/12/31"
      }
    }
  }
}

or specifying a new one:

GET /product/_search
{
  "query": {
    "range": {
      "created": {
        "gte": "01-01-2010",
        "lte": "31-12-2010",
        "format": "dd-MM-yyyy"
      }
    }
  }
}

The exists query returns documents that contain an indexed value for a field. An indexed value may not exist for a document’s field due to a variety of reasons:

The field in the source JSON is null or [];
The field has "index": false set in the mapping;
The field value is longer than an ignore_above setting in the mapping;
The field value was malformed and ignore_malformed was defined in the mapping.

For example, we can search for all products having the “tags” field not empty:

GET /product/_search
{
  "query": {
    "exists": {
      "field": "tags"
    }
  }
}

A prefix query returns documents that contain a specific prefix in a provided field. An example could be the following:

GET /product/_search
{
  "query": {
    "prefix": {
      "tags.keyword": "Vege"
    }
  }
}

A wildcard query returns documents containing terms matching a wildcard pattern, i.e. a combination of wildcard operators (placeholders that match one or more characters). The wildcard pattern supports the “?” (any single character) and “*” (zero or more characters) operators:

GET /product/_search
{
  "query": {
    "wildcard": {
      "tags.keyword": "Veg*ble"
    }
  }
}
GET /product/_search
{
  "query": {
    "wildcard": {
      "tags.keyword": "Veget?ble"
    }
  }
}

A regexp query looks for documents containing terms matching a regular expression. For a list of operators supported by such a query, see Regular expression syntax. Let’s try again to retrieve all vegetables from the product index:

GET /product/_search
{
  "query": {
    "regexp": {
      "tags.keyword": "Veg[a-zA-Z]+ble"
    }
  }
}

Full text queries

The full text queries enable to search analyzed text fields by processing the query string with the same analyzer applied to the field during indexing. Some examples of such queries are the following.

The match query returns documents matching a provided text, number, date or boolean value. The match query is the standard query for performing a full-text search. For instance, we can look for products where the “name” value matches the “coffee” query:

GET /product/_search
{
  "query": {
    "match": {
      "name": "coffee"
    }
  }
}

The match phrase query analyzes the text and creates a phrase query out of the analyzed text. An example could be searching products which match the “coffee cup” phrase:

GET /product/_search
{
  "query": {
    "match_phrase": {
      "name": "coffee cup"
    }
  }
}

The multi match query allows multi-field queries. Internally a match query is executed on each field. There is a limit on the number of fields that can be queried at once (default 1024). Let’s look for “coffee” in the “tags” field as well:

GET /product/_search
{
  "query": {
    "multi_match": {
      "query": "coffee",
      "fields": [ "name", "tags" ]
    }
  }
}

Conclusion

You should be able to query documents using term-level queries and full-text queries. In order to fully understand this topics, we recommend you to try to write some queries on the product index by yourself and explore the results you will get.

In the next article we will take a look at bool queries, joining queries and how to handle relations in Elasticsearch.