How To Master the Elasticsearch Query DSL
Simplifying the Elasticsearch Query DSL
Querying Elasticsearch can be very confusing, especially when you’re just starting to work with the engine. In this article, I would like to give you a jump-start and simplify this subject.
Before we dive in, I’d like to mention a few points about the Elasticsearch indexing and mapping process.
We have two documents:
Doc_1 — “in the summer the quick brown fox jump over the lazy dog”
Doc_2 — “the quick brown fox jump over the lazy dog”
Both documents are indexed by Elasticsearch. The result of the indexing process is an inverted index:
Each token in the text is mapped to the corresponding documents.
During the indexing process, the text is transformed:
- Character filter: One or more character filter that cleans up the text and strips unwanted characters like HTML tags
- Tokenizer: Single tokenizer that breaks down the string into simple words (tokens)
- Token filters: Zero or more token filters that perform tasks such as lowercase token filter, stop words token filter, synonym filter, etc.
- Analyzer: Character filter + tokenizer + token filters
Those three elements define an analyzer. Each index has an analyzer attached to it. Elasticsearch has built-in analyzers, and you can also build your own custom analyzer and attach it to your index.
According to Elasticsearch documentation:
Mapping is the process of defining how a document, and the fields it contains, are stored and indexed. For instance, use mappings to define:
Which string fields should be treated as full-text fields.
Which fields contain numbers, dates, or geolocations.
The format of date values.
Custom rules to control the mapping for dynamically added fields.
When you create a new index, you have three options:
- Define the mapping of each field on your own.
- Use dynamic mapping and let Elasticsearch guess the mapping.
- Use both — define the important fields, and let the Elasticsearch engine handle the rest of the fields.
Fields and mapping types do not need to be defined before being used. Thanks to dynamic mapping, new field names will be added automatically, just by indexing a document. New fields can be added both to the top-level mapping type, and to inner and fields. — Elasticsearch documentation
Dynamic mapping rules:
Text fields can be mapped as:
Full-text— If the field is the body of an email or a product description, then the field should be mapped as full text. The text is tokenized based on the analyzer, and you can search each word in the text individually.
Keyword— If you need to index structured content — such as email addresses, host names, status codes, or tags — likely, you should use a keyword field. The string is considered as a single unit, and the whole string is indexed. There is no option for partial matches
Elasticsearch’s dynamic mapping is mapping text fields with both types, so you can search it either way (exact phrase or partial):
JSON doesn’t have a date data type, so dates in Elasticsearch can either be:
1. Strings containing formatted dates, e.g. “2015–01–01” or “2015/01/01 12:10:30”.
2. A long number representing milliseconds-since-the-epoch.
3. An integer representing seconds-since-the-epoch.
Internally, dates are converted to UTC (if the time zone is specified) and stored as a long number representing milliseconds-since-the-epoch.
You can define your custom date format:
Elasticsearch supports other field types, you can take a look at them here.
Build Our Query
Every query starts with a
When we query Elasticsearch, we need to take into account two things:
- Remember all the queries run against our inverted index. The analyzer (built-in or custom) that we choose for our index will affect our query clause (lowercase, stem words, remove stop words, etc.).
- The mapping configuration of each field can affect our query. For example:
- Text field: Is our field configured to be full-text or keyword?
- Date: Which date format did we chose for our field?
- Number: Is our field type is integer, long, or float?
When we write a query, we can use two types of clauses:
- Compound query clause: This will be our wrapper clauses; they can combine leaf queries and nested compound queries.
- Leaf query clause: query term for a particular field (field name and value).
Compound Query Clauses
Before we start to write a compound query, we need to:
- Decide if we need a score for each document? The score will tell us the relevance of each document relative to the other results.
- What are the fields we need to query?
- Which fields control the score of the document?
First, let’s understand the concept of context in Elasticsearch.
In Elasticsearch we have two contexts of search:
In the query context, a query clause answers the question “How well does this document match this query clause?” Besides deciding whether or not the document matches, the query clause also calculates a relevance score in the
_scoremeta-field. — Elasticsearch documentation
Query context is, in effect, whenever a query clause is passed to a query parameter. This could be a query clause or, for example,
must_not clauses of the Boolean compound query.
The Elasticsearch documentation mentions at each clause documentation if it contributes to the final score or not.
In the example above, we have a
must clause. Query context means that the leaf queries inside it will affect the score of the matching documents.
This is the theory behind the scoring algorithm. The score is very helpful when you want to order your results by relevance.
In filter context, a query clause answers the question “Does this document match this query clause?” The answer is a simple Yes or No — no scores are calculated. Filter context is mostly used for filtering structured data, e.g.
1. Does this
timestampfall into the range 2015 to 2016?
2. Is the
statusfield set to
The filter context is, in effect, whenever a query clause is passed to a filter parameter. For example, the filter
must_not parameter can be passed to the
bool compound query.
Like the query context, you should look at the documentation to see if the clause query affects the scoring or not.
In the example above, we have the
filter clause that is doing a filter context, which means that the leaf queries inside it won’t affect the score of the matching documents.
Inside our search clause, we can combine query and filter context with a compound query like
bool. In that case, only the search terms that appear in the query context clauses affects the score of each document. If we only have a filter context, then all the documents will have a score of zero.
The decisions we took before will determine the layout of our query.
For example, this query uses query context and filter context together:
Only the query clauses that appear inside the
should clauses will affect the score of each document (they are query context).
Elasticsearch takes a more-matches-is-better approach, meaning that score from the
should will be added together to provide the final score.
If we don’t need a score at all, we can use only the filter clause. For example, if we search over structured data or search for exact values like binary or dates, we’ll only use the filter context:
All the matching documents in the result of the query above will have a score of zero.
Leaf Query Clauses
While building our outer layout, we decided what the building blocks of our query are. We also decided which fields will determine our results score. As you can see, Elasticsearch has a lot of options, and we only covered the basics in this article. Each compound query can wrap other compound queries and so on. My advice to you is to try to keep it as simple as possible.
Now it’s time to write our inner/leaf search query (which will come inside our container clauses).
Here, we also have decisions to make.
For every field that we search on, we need to:
1. Decide if this field is relevant to the score of the documents
- Yes: Put it inside a query clause.
- No: It should be under a filter clause (remember a filter can only be nested inside a Boolean clause).
2. Check the type of the field and how it was mapped
- Querying text fields, for example, is tricky. If the text field was mapped as a keyword, then we only have the option of searching it in the exact way it was indexed (not tokenized, uppercase/lowercase letters, etc.).
Let’s say, for example, we have indexed a document with a
notes field, and it contains the text “The quick brown fox.”
- If the
notesfield was mapped as a keyword, then the inverted index would contain the “The quick brown fox” text mapped to that document. searching “The quick brown fox” text exactly will match that document.
- If the
notesfield was mapped as full text, then in the inverted index, we’ll have the tokens
foxseparately connected to the document — searching any of these tokens or their synonyms will match that document
3. Decide how our text will be sent to the search engine
When we send a query to the Elasticsearch engine, we have two options:
- Send it as is: For that choice, we use the term level queries. For example, if you search for the phrase “Star Trek,” then the query engine will check the inverted index for “Star Trek.”
- Send it analyzed: For that choice, we use the full-text queries. The searched text will pass through the same analyzer as the indexed text passed in the indexing process (we can also provide different analyzers as a property to the search service). It will be tokenized and filtered. For example, if you search for the phrase “Star Trek,” then the query engine will check the inverted index for “star,” “trek” (depending on the analyzer you chose).
Note: If the field was originally mapped as a keyword, then you’ll have to send the exact text as it was indexed to get results
Most of the time, we’ll want the searched text to be analyzed before it’s sent to the search engine. It’ll give better results this way. But sometimes, we want to search the exact word or sentence — usually in data like numbers, dates, and enums.
Full-text query example
Term query example
Compound query — final detailed example
query— main query container
bool— compound query container
must— this is a query context query, each leaf query inside it will contribute to the score of the matching documents
match— this is a full-text query, meaning the text “Jeff Bridges” will pass through the analyzer and transformed to “jeff,” “bridges.” Make sure you use that option only if the
mail_bodyfield was mapped as a full-text field.
filter— this is a filter context query. Each leaf query inside it won’t contribute to the score of the matching documents, and the clauses are considered for caching.
term— this is a term level query. The text “firstname.lastname@example.org” won’t pass through the analyzer and will be sent as is to the search engine.
Elasticsearch Query DSL isn’t the simplest thing to use, but once you know how to use it, it can be a powerful tool.
In this article, I tried to give you guys a jump-start for querying Elasticsearch, and I encourage you to dig deeper into the Elasticsearch documentation.
Once you understand all the concepts we discussed in the article, you’ll find it easier to walk through the Elasticsearch documentation and find all the solutions you need.