Data Dojo
Published in

Data Dojo

Mastering ElasticSearch Queries If You Have Only Worked With SQL Before

Elasticsearch is often the storage engine of choice for storing and querying full text data. But writing an ElasticSearch query is pretty different compared to querying a relational database in SQL. In this blogpost, you will learn some basics you need to understand before working with ElasticSearch. In the second part, you learn how to write queries in ElasticSearch.

ElasticSearch uses many of the same concepts as your SQL Database. The terminology is just a little different. The following table gives an overview

ElasticSearch vs. Relational Database

An index ( “database”) stores documents ( “rows in a tabe”) and has per default 5 shards ( “partitions”). All documents ( “rows”) given a type ( “table”) in an index ( “database”) have the same mapping ( “schema”). Documents ( “rows”) are stored in JSON-Format (you might know this from the complex data type struct in Hive for example). In a process called mapping, you define how a document, and the fields ( "columns") it contains, are stored and indexed. So the mapping is similar to the schema of a relational database. There are two options for string mappings:

  • exact values (keyword): keyword fields are useful for structured content such as city names, ids or email addresses. The whole string is indexed and retrieval is based on exact matches. So "Hamburg" is not the same as "hamburg". There is no option for partial matches. You would not choose a keyword field for full-text data.
  • Full text a.k.a. analyzed text(text): As the name implies, text data in this field will be analyzed. There are various built in analyzers. Most analyzer convert the text into a set of tokens, perform some kind of standardization (lowercase, stemming, lemmatization etc.) and add the field to the inverted index so it is searchable quickly (You can checkout my blogpost. on the inverted index and scoring, if you want to know more). By default, Elasticsearch sorts the results by relevance score. The higher the score the higher the rank of the document. This relevance_score is returned as a metafield. There are different query types that can modify the scores and customize the order of search results. Speaking of queries: Let's see how to write queries in ElasticSearch.

Query Tools

You have multiple options. For example:

  • You can query the index directly with curl.
  • You can use interfaces such as Kibana, which also provide some nice vizualization options. This one might come the most natural to you if you are used to using SQL-Clients.
  • You could query from python using the ElasticSearch Client

Using ElasticSearch Query DSL

Instead of using SELECT FROM WHERE syntax, you will write your query in json format. This is the basic anatomy of an elastic search query.

{ 
"query": { ... }
"sort": { ... }
"from": { ... }
"size": { ... }
}

So what does that mean?

The query clause is mandatory. Sorting, offsetting and limiting your search results is optional. So let’s dive deeper into the queries: There are two different kinds of query clauses.

  • Leaf query clauses Leaf query clauses look for a particular value in a particular field. These queries can be used by themselves. Examples are term and range. We'll see examples in a minute.
  • Compound query clauses Compound query clauses wrap leaf or other compound queries and therefore combine multiple queries (for example with boolean operators ( must, must_not, should) or by boosting specific matches with altering the _score. We'll get into some examples soon.

Leaf query clauses

Filtering with match and term

The term and terms query clauses are used for "strict filters". You apply those queries to keyword (exact value) fields. The field then must exactly match the field value, which includes whitespace and capitalization. It's like a simple WHERE clause in SQL without regexes. There is an analyzer applied. You can filter by single or multiple values. For example, this query finds all documents in field <<field>> with value <value>

while this query retrieves all documents with <value_a> or <value_b>.

The match query clause is the most generic and commonly used query clause. You can use it for text and keyword fields. It automatically figures out what to do: If you apply match on a keyword field, it will work as a filter and return documents with exact matches. If you use match on a text field it will perform an analyzed search, meaning that it applies the same analyzer as the field mapping, if you do not explicitly choose a different analyzer. You can write a match query as follows:

Compound query clauses

The bool query clause is an example of a compound query clause, as it is used to combine multiple query clauses using boolean operators. Just wrap the bool around your operators must ("AND"), must_not ("NOT") and should ("OR"). The scores from must, must_not, and should will be added together to provide the final score.

{ "bool": 
{ "must": { "term": { "tag": "math" }},
"must_not": { "term": { "tag": "probability" }},
"should": [
{ "term": { "favorite": true }},
{ "term": { "unread": true }} ]
}
}

You can also change the score of documents. For example a constant score query sets the relevance score equal to a specified value given by boost. For example this query doubles the default score of 1 for an exact match in <field>.

{ "query": 
{ "constant_score":
{ "filter": {
"term": { "<field>": "<some query string>" }
},
"boost": 2
}
}
}

You can also (de)boost documents without excluding them from the search results with a boosting query or boost matches in some fields more than in others. You can even apply ranking models like XGBoost for personalized search. But this is subject for another blogpost.

1. Get to know the mapping of your index

Check the mappings of the index with

GET <indexname>/_mapping

You have to understand, if the text field was mapped as a keyword or as text. If it is mapped as keyword (or other structured fields) you can only search for exact matches in the exact way the field was indexed. You find an overview for all term-level queries in the ES documentation. If the field is mapped as text it means it was modified by an analyzer. Your query string will be processed by the same analyzer (You have to figure out which analyzer is in place. Most analyzers tokenize and remove stop words. Some analyzers include synonyms, some use stemming or lemmatization techniques). You find an overview of all full text queries in the ES documentation

2. Decide what you need: all documents that match vs. most relevant documents

All documents: You just want to know, if there is a match or there isn’t. Then you can use filter queries, which do not compute a score. You would mostly use them on structured fields like keyword, date or boolean fields. Just wrap a filter clause(s) around your queries.

{ "filter": 
[ { "term": { "<field>": "<value>" }},
{ "range": { "<field>": { "gte": "<date>" }}}
}

Documents ordered by relevance: Put your queries inside a query clause ("Query Context") to automatically compute relevance score and maybe change the scores for reranking for example with boost.

{ "query": 
{ "bool":
{ "must": [
{ "match": { "<field_a>": "<value_a>"}},
{ "match": { "<field_b>": "<value_b>" }}
],
}
}
}

3. Decide how you want to query: querystring as is vs. analyzed querystring

Send query string as is without modification: You have to use a term query or other term-level queries. For example, if you search for "Hello World!" with a term query, the query engine will check the inverted index for an exact match for "Hello World!". "hello world!", "Hello World" or "Hallo World!" would not be retrieved.

Send query string analyzed: For that choice, you have to use the full text queries in the ES documentation. The querystring will process through the same analyzer as the field in the index, i.e. it will be tokenized, standardized, filtered. For example, if you search for the phrase “Hello World” then the query engine might check the inverted index for “hello” and “world” and (“!”) (depending on the analyzer you chose before).

Overview

  • query- main query container
  • bool - compound query container for must ("AND"), must_not ("NOT") and should ("OR")
  • filter - filter container for strict filtering: Each leaf query inside it won't contribute to the score of the matching documents
  • match - this is a full-text query, meaning the text will pass through the analyzer and transformed.
  • term - this is a term level query. The text won't pass through the analyzer and will be sent as is to the search engine.
  • apply expert knowledge by boosting with constant_score or bosting .(by^ operator, e.g. ["<field1>^2", "<field2>^3]").

Originally published at https://datadojo.dev on June 27, 2020.

--

--

--

I transitioned from a business student to Data Analytics to Data Science to Data Engineering. This blog is about sharing things I learned on the way and are useful for anyone moving into the field of Data Science/Data Engineering without a technical background.

Recommended from Medium

Dairy Me

Recursion and Python Strings

Hummingbird Finance has reached a partnership with SHIBARMY

How To: CodeBuild with Docker Image

NET603 Week 9 — Progress Report

Tangram Flex Awarded Contract for Agility Prime

How to paginate queries in SQL

My Django dev setup

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Heike Maria

Heike Maria

PhD, data enthusiast with business background and karate kid. Currently working as a Data Scientist @Xing

More from Medium

Paging by Search After — ElasticSearch

Migrating sub-select SQL query to Presto

Elastic, data aggregation

How to using Mapping and Aliases on Elasticsearch