How To Master the Elasticsearch Query DSL

Simplifying the Elasticsearch Query DSL

Royi Benita
Nov 7 · 9 min read
Photo by Andrew Neel on Unsplash

Querying Elasticsearch can be very confusing, especially when you’re just starting to work with the engine. In this article, I would like to give you a jump-start and simplify this subject.

Our query is sent to Elasticsearch’s _search API in the body of the request. Usually, we’d use one of the Elasticsearch client SDKs, depending on the language we want to use.

Before we dive in, I’d like to mention a few points about the Elasticsearch indexing and mapping process.

Indexing process

We have two documents:

Doc_1 — “in the summer the quick brown fox jump over the lazy dog”
Doc_2 — “the quick brown fox jump over the lazy dog”

Both documents are indexed by Elasticsearch. The result of the indexing process is an inverted index:

Each token in the text is mapped to the corresponding documents.
During the indexing process, the text is transformed:

  • Character filter: One or more character filter that cleans up the text and strips unwanted characters like HTML tags
  • Tokenizer: Single tokenizer that breaks down the string into simple words (tokens)
  • Token filters: Zero or more token filters that perform tasks such as lowercase token filter, stop words token filter, synonym filter, etc.
  • Analyzer: Character filter + tokenizer + token filters

Those three elements define an analyzer. Each index has an analyzer attached to it. Elasticsearch has built-in analyzers, and you can also build your own custom analyzer and attach it to your index.

Mapping

According to Elasticsearch documentation:

Mapping is the process of defining how a document, and the fields it contains, are stored and indexed. For instance, use mappings to define:

Which string fields should be treated as full-text fields.

Which fields contain numbers, dates, or geolocations.

The format of date values.

Custom rules to control the mapping for dynamically added fields.

When you create a new index, you have three options:

  1. Define the mapping of each field on your own.
  2. Use dynamic mapping and let Elasticsearch guess the mapping.
  3. Use both — define the important fields, and let the Elasticsearch engine handle the rest of the fields.

Fields and mapping types do not need to be defined before being used. Thanks to dynamic mapping, new field names will be added automatically, just by indexing a document. New fields can be added both to the top-level mapping type, and to inner and fields. — Elasticsearch documentation

Dynamic mapping rules:


String Fields

Text fields can be mapped as:

  • Full-text — If the field is the body of an email or a product description, then the field should be mapped as full text. The text is tokenized based on the analyzer, and you can search each word in the text individually.
  • Keyword — If you need to index structured content — such as email addresses, host names, status codes, or tags — likely, you should use a keyword field. The string is considered as a single unit, and the whole string is indexed. There is no option for partial matches

Elasticsearch’s dynamic mapping is mapping text fields with both types, so you can search it either way (exact phrase or partial):

{
"name": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}

Date Fields

JSON doesn’t have a date data type, so dates in Elasticsearch can either be:
1. Strings containing formatted dates, e.g. “2015–01–01” or “2015/01/01 12:10:30”.
2. A long number representing milliseconds-since-the-epoch.
3. An integer representing seconds-since-the-epoch.

Elasticsearch documentation

Internally, dates are converted to UTC (if the time zone is specified) and stored as a long number representing milliseconds-since-the-epoch.

You can define your custom date format:

{
"mappings": {
"properties": {
"date": {
"type": "date",
"format": "yyyy-MM-dd"
}
}
}
}

Elasticsearch supports other field types, you can take a look at them here.


Build Our Query

Every query starts with a “query” clause.

{
"query": {

}
}

When we query Elasticsearch, we need to take into account two things:

  1. Remember all the queries run against our inverted index. The analyzer (built-in or custom) that we choose for our index will affect our query clause (lowercase, stem words, remove stop words, etc.).
  2. The mapping configuration of each field can affect our query. For example:
  • Text field: Is our field configured to be full-text or keyword?
  • Date: Which date format did we chose for our field?
  • Number: Is our field type is integer, long, or float?

When we write a query, we can use two types of clauses:

  1. Compound query clause: This will be our wrapper clauses; they can combine leaf queries and nested compound queries.
  2. Leaf query clause: query term for a particular field (field name and value).

Compound Query Clauses

Before we start to write a compound query, we need to:

  1. Decide if we need a score for each document? The score will tell us the relevance of each document relative to the other results.
  2. What are the fields we need to query?
  3. Which fields control the score of the document?

First, let’s understand the concept of context in Elasticsearch.
In Elasticsearch we have two contexts of search:

Query context

In the query context, a query clause answers the question “How well does this document match this query clause?” Besides deciding whether or not the document matches, the query clause also calculates a relevance score in the _score meta-field. — Elasticsearch documentation

Query context is, in effect, whenever a query clause is passed to a query parameter. This could be a query clause or, for example, must, should, and must_not clauses of the Boolean compound query.

The Elasticsearch documentation mentions at each clause documentation if it contributes to the final score or not.

In the example above, we have a must clause. Query context means that the leaf queries inside it will affect the score of the matching documents.

This is the theory behind the scoring algorithm. The score is very helpful when you want to order your results by relevance.

Filter context

In filter context, a query clause answers the question “Does this document match this query clause?” The answer is a simple Yes or No — no scores are calculated. Filter context is mostly used for filtering structured data, e.g.

1. Does this timestamp fall into the range 2015 to 2016?
2. Is the
status field set to "published"?

Elasticsearch documentation

The filter context is, in effect, whenever a query clause is passed to a filter parameter. For example, the filter must_not parameter can be passed to the bool compound query.

Like the query context, you should look at the documentation to see if the clause query affects the scoring or not.

In the example above, we have the filter clause that is doing a filter context, which means that the leaf queries inside it won’t affect the score of the matching documents.

Inside our search clause, we can combine query and filter context with a compound query like bool. In that case, only the search terms that appear in the query context clauses affects the score of each document. If we only have a filter context, then all the documents will have a score of zero.

The decisions we took before will determine the layout of our query.
For example, this query uses query context and filter context together:

Only the query clauses that appear inside the must, must_not, and should clauses will affect the score of each document (they are query context).

Elasticsearch takes a more-matches-is-better approach, meaning that score from the must, must_not, and should will be added together to provide the final score.

If we don’t need a score at all, we can use only the filter clause. For example, if we search over structured data or search for exact values like binary or dates, we’ll only use the filter context:

All the matching documents in the result of the query above will have a score of zero.


Leaf Query Clauses

While building our outer layout, we decided what the building blocks of our query are. We also decided which fields will determine our results score. As you can see, Elasticsearch has a lot of options, and we only covered the basics in this article. Each compound query can wrap other compound queries and so on. My advice to you is to try to keep it as simple as possible.

Now it’s time to write our inner/leaf search query (which will come inside our container clauses).

Here, we also have decisions to make.

For every field that we search on, we need to:

1. Decide if this field is relevant to the score of the documents

  • Yes: Put it inside a query clause.
  • No: It should be under a filter clause (remember a filter can only be nested inside a Boolean clause).

2. Check the type of the field and how it was mapped

  • Querying text fields, for example, is tricky. If the text field was mapped as a keyword, then we only have the option of searching it in the exact way it was indexed (not tokenized, uppercase/lowercase letters, etc.).

Let’s say, for example, we have indexed a document with a notes field, and it contains the text “The quick brown fox.”

  • If the notes field was mapped as a keyword, then the inverted index would contain the “The quick brown fox” text mapped to that document. searching “The quick brown fox” text exactly will match that document.
  • If the notes field was mapped as full text, then in the inverted index, we’ll have the tokens the, quick, brown, fox separately connected to the document — searching any of these tokens or their synonyms will match that document

3. Decide how our text will be sent to the search engine

When we send a query to the Elasticsearch engine, we have two options:

  1. Send it as is: For that choice, we use the term level queries. For example, if you search for the phrase “Star Trek,” then the query engine will check the inverted index for “Star Trek.”
  2. Send it analyzed: For that choice, we use the full-text queries. The searched text will pass through the same analyzer as the indexed text passed in the indexing process (we can also provide different analyzers as a property to the search service). It will be tokenized and filtered. For example, if you search for the phrase “Star Trek,” then the query engine will check the inverted index for “star,” “trek” (depending on the analyzer you chose).

Note: If the field was originally mapped as a keyword, then you’ll have to send the exact text as it was indexed to get results

Most of the time, we’ll want the searched text to be analyzed before it’s sent to the search engine. It’ll give better results this way. But sometimes, we want to search the exact word or sentence — usually in data like numbers, dates, and enums.

Full-text query example

Term query example

Compound query — final detailed example

  • query — main query container
  • bool — compound query container
  • must — this is a query context query, each leaf query inside it will contribute to the score of the matching documents
  • match — this is a full-text query, meaning the text “Jeff Bridges” will pass through the analyzer and transformed to “jeff,” “bridges.” Make sure you use that option only if the mail_body field was mapped as a full-text field.
  • filter — this is a filter context query. Each leaf query inside it won’t contribute to the score of the matching documents, and the clauses are considered for caching.
  • term — this is a term level query. The text “emma@somemail.com” won’t pass through the analyzer and will be sent as is to the search engine.

Final Thoughts

Elasticsearch Query DSL isn’t the simplest thing to use, but once you know how to use it, it can be a powerful tool.

In this article, I tried to give you guys a jump-start for querying Elasticsearch, and I encourage you to dig deeper into the Elasticsearch documentation.

Once you understand all the concepts we discussed in the article, you’ll find it easier to walk through the Elasticsearch documentation and find all the solutions you need.

Better Programming

Advice for programmers.

Royi Benita

Written by

Senior Full Stack Developer. Enthusiastic about new technologies and architecture. More about me: www.linkedin.com/in/royi-benita-224a3014

Better Programming

Advice for programmers.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade