Elastic Search: Complex Queries

When I started to explore Elasticsearch for search and filtering, I have came across many blogs generally talks about how to get started, basic search and so on. But, what I see common among all the blogs is that, all authors didn’t prefered to go little deep to share how things actually works. Believe me, It just took me 5 minutes to setup ElasticSearch and get started with basic search. It’s that simple. But, It took me almost a day to understand tokenizer, difference between using term and terms, match vs match_all vs match_phrase, aggregations, range query, and how to build complex filter queries. So, I am going to talk less about how to get started and more and more about how to actually make use of Elasticsearch.

To understand concepts better, I’m going to take eCommerce product filter as an example for this blog. Like for example, Getting all mobile phones from brand X and price range between P1 to P2 and P3 to P4.

Product filter . image credits : zefo

Please follow installation instruction from https://www.elastic.co/downloads/elasticsearch to setup ElasticSearch. Once un archived, run bin/elasticsearch to start elasticsearch server. To verify elasticsearch server is up and running, use below command to verify.

curl -X GET http://localhost:9200/

You should see some response like below. We’re good to go!

{
“name” : “8KwKH_k”,
“cluster_name” : “elasticsearch”,
“cluster_uuid” : “7C0FnVVTTlG9Z80q_WjsGA”,
“version” : {
“number” : “5.5.2”,
“build_hash” : “b2f0c09”,
“build_date” : “2017–08–14T12:33:14.154Z”,
“build_snapshot” : false,
“lucene_version” : “6.6.0”
},
“tagline” : “You Know, for Search”
}

Let’s start by creating index (nothing but database) and bulk insert some dummy products. I already have some dummy products hosted on github. Use below commands to import them.

curl -O https://raw.githubusercontent.com/sherlockcodes/elasticsearch/master/zefo/zefo/products.json
curl -s -XPOST 'http://localhost:9200/_bulk?pretty' — data-binary @products.json

To verify whether all products are imported or not , use match_all condition and pass empty query condition object like below.

curl -XGET 'localhost:9200/mobile/_search?pretty' -d '
{
"query" : {
"match_all" : {}
}
}'

Let’s try getting all mobiles from Micromax brand using below curl command.

curl -XGET 'localhost:9200/mobile/_search?pretty' -d '
{
"query" : {
"match" : {"brand":"Micromax"}
}
}'

You should see 2 results from Micromax Brand. Now, Let’s try to get all mobile phones in “Unboxed Phones” condition.

curl -XGET 'localhost:9200/mobile/_search?pretty' -d '
{
"query" : {
"match" : {"condition":"Unboxed Phones"}
}
}'

When you try above search query, You will see mobiles with other conditions in result as well. Strange! Isn’t it?

Internally, Match condition will split given input into words. If any one of word is present in value while searching, document will be added to the result. while adding document to the result, document will be scored based on relevancy of given input query with value in the document. Which means most relevant document will appear first in the result. cool!

But, we want to get only mobiles in Unboxed Phones condition right?. When we want to work with exact values, we will be working with non-scoring, filtering queries. Filtering queries are very important because they are really fast because they do not calculate relevance (avoid complete scoring process) and results are cached. So we will use a constant_score query to execute the term query in a non-scoring mode and apply a uniform score of one like below.

curl -XGET 'localhost:9200/mobile/_search?pretty' -d '
{
"query" : {
"constant_score" : {
"filter" : {
"term" : {
"condition" : "Unboxed Phones"
}
}
}
}
}'

But still there is little hiccup. We don’t get any results back! There is nothing wrong with filter and term query. It is to do with the way elastic search index the value before inserting. You can use analyze API to see how elastic search tokenize value into smaller tokens using below command.

curl -XGET 'localhost:9200/mobile/_analyze?pretty' -d '
{
"field" : "condition",
"text" : "Unboxed Phones"
}'

Below result shows how “Unboxed Phones” text will be tokenized into small tokens by elasticsearch.

{
“tokens” : [
{
“token” : “unboxed”,
“start_offset” : 0,
“end_offset” : 7,
“type” : “<ALPHANUM>”,
“position” : 0
},
{
“token” : “phones”,
“start_offset” : 8,
“end_offset” : 14,
“type” : “<ALPHANUM>”,
“position” : 1
}
]
}

While inserting a document, elastic will tokenize value into small tokens by removing special characters and also all letters will be lowercased. To prevent this from happening, you need to tell elastic that it is exact value and it shouldn’t be analyzed to split into tokens. To do this, we have to delete the index first because of incorrect mapping and create one with below commands.

DELETE INDEX:

curl -XDELETE 'localhost:9200/mobile?pretty'

CREATE INDEX:

curl -XPUT 'localhost:9200/mobile' -d '{
"mappings": {
"doc": {
"properties": {
"condition" : {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}'

Now, You have to bulk import all the products again and retry the same filtered term query. You should see result with one mobile matching in the result. So far, We’ve covered how to do basic search and filtering, how elastic search tokenize value, scoring and relevancy.

Now I would also like to explore mobiles in two different conditions like “Gently Used Phones” and “Unboxed Phones”. We can achieve this by combining two term query or you can use terms condition to achieve it.

curl -XGET 'localhost:9200/mobile/_search?pretty' -d '
{
"query" : {
"constant_score" : {
"filter" : {
"terms" : {
"condition" : ["Unboxed Phones","Gently Used Phones"]
}
}
}
}
}'

Only difference is that, terms filter results are automatically cached which means that succeeding calls will be faster since first call will cache the result of the above filter. Whereas, filter of multiple term filter is not cached automatically. It can be done manually though. But results you will getting back should be same in both term and terms filter. So, prefer terms over term condition in these type of use cases.

Let’s make it more complex by querying on multiple fields to include price range to narrow down the filter results. To achieve this, we have to understand how to construct query to satisfy multiple conditions in different fields. for example, If you want to get mobile from brand X & Y and price range between P1 to P2 and P3 to P4.

To do this in elastic search, we have to use Bool Query AKA complex queries. A query that matches documents matching boolean combinations of other queries. Bool query syntax will look something like below. You can include one with other depends query needs.

{
“query”:
{
“bool”:
{
“must”:{
//basically, AND condition in SQL queries.
},
“filter”:{
// Another form of AND condition, But without scoring.
},
“should”:{
// OR condition
},
“must_not”:{
// NOR condition
}
}
}
}

We use below query to get mobile phones from micromax & samsung and price range between 6000 to 10000 and 16000 to 30000.

curl -XGET 'localhost:9200/mobile/_search?pretty' -H 'Content-Type: application/json' -d'
{
"query":{
"bool":{
"must":[
{"terms":{"brand":["micromax","samsung"]}}
] ,
"should":[
{ "range": { "price": { "gte": 6000, "lte": 10000 } } },
{ "range": { "price": { "gte": 16000, "lte": 30000 } } }
]
}
}
}'

So, all AND conditional query must go inside must block. And all OR conditional query to must go inside should block. We can also re-write above query like below. It is like constructing query by putting multiple conditional block within one block.

curl -XGET 'localhost:9200/mobile/_search?pretty' -H 'Content-Type: application/json' -d'
{
"query":{
"bool":{
"must":[
{"terms":{"brand":["micromax","samsung"]}},
{
"bool":{
"should":[
{ "range": { "price": { "gte": 6000, "lte": 10000 } } },
{ "range": { "price": { "gte": 16000, "lte": 30000 } } }
]
}
}
]
}
}
}'

Let’s try to get some summary of search results for understanding purpose by using aggregate framework like below.

curl -XGET 'localhost:9200/mobile/_search?pretty' -H 'Content-Type: application/json' -d'
{
"query":{
"bool":{
"must":[
{"terms":{"brand":["micromax","samsung"]}},
{
"bool":{
"should":[
{ "range": { "price": { "gte": 6000, "lte": 10000 } } },
{ "range": { "price": { "gte": 16000, "lte": 30000 } } }
]
}
}
]
}
},
"aggs": {
"by_condition": {
"terms": {
"field": "condition",
"size":5
}
},
"by_camera": {
"terms": {
"field": "camera",
"size":5
}
}
}
}'

The size parameter can be set to define how many term buckets should be returned out of the overall terms list. If you are passing size parameter, Each shard will return top terms of given size. Note that aggregations results may not be accurate because of sharding.

In response, you will see result with aggregate view of search like below:

"buckets" : [
{
"key" : "Gently Used Phones",
"doc_count" : 2
},
{
"key" : "Unboxed Phones",
"doc_count" : 1
}
]

So, here we’re at end of this post. We have covered:

1) How to get started
2) Tokenizer
3) Term vs Terms query
4) Range query
5) How to build complex query
6) Basic aggregations.

Like what you read? Give Hari Krishnan a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.