A Guide to Aggregations in Elasticsearch
Written by Krupa and Sindhuri Kalyanapu
In the previous part, we went through the journey of setting up full-text search in ElasticSearch. In this blog, we will talk in detail about aggregations in ElasticSearch.
Aggregates
Consider analytics dashboards having filters to narrow down search results. Each of the filter attributes can have a unique set of values. For instance, drill down by gender. These values are nothing but aggregate values of a field.
Aggregate values can easily be retrieved in ElasticSearch. An aggregate query can be performed on the field gender to retrieve its unique values. Further searches can be performed on the selected gender type thereby reducing the search result set.
One challenge here is aggregations are restricted to keyword fields. Our schema defined for the FullText Search use case mostly comprises of text fields (refer the Previous Blog for schema details). Performing aggregate operations on the gender field will result in this exception:
Fielddata is disabled on text fields by default. Set `fielddata=true` on [`your_field_name`] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory.
But what if one wants to perform both aggregation and fullText search on the same set of fields? Here are some options
- Enabling fieldData for the fields :
FieldData can consume a lot of heap space, especially when loading high cardinality text fields. Once fielddata has been loaded into the heap, it remains there for the lifetime of the segment. Also, loading fielddata is an expensive process that can cause users to experience latency hits. Hence fielddata is turned off by default.
2. Using multiFields :
It is the mechanism of mapping the same field to different datatypes. Here in our case, we have to map the fields with text data type and keyword datatype. Although this approach hogs more space, this is more feasible than the previous approach as it doesn’t have an impact on the heap memory.
After pondering over the above two approaches, Multi-Fields is the most feasible amongst the two. Aggregations are expensive; business needs draw a line on the must and can have aggregates.
In our case, let us assume that we require aggregations on gender field. The users index mapping is to be changed as follows:
curl — location — request PUT 'http://localhost:9200/users' \
— header 'Content-Type: application/json' \
— data-raw '{
"settings": {
"index.max_ngram_diff" : 10,
"analysis": {
"analyzer": {
"ngram_analyzer": {
"tokenizer": "ngram_tokenizer",
"filter": ["lowercase"]
}
},
"tokenizer":{
"ngram_tokenizer": {
"type": "ngram",
"min_gram": 3,
"max_gram": 10,
"token_chars": ["letter","digit"]
}
},
"normalizer": {
"lowercase_normalizer": {
"type": "custom",
"filter": ["lowercase"]
}
}
}
},
"mappings": {
"dynamic": "true",
"dynamic_templates": [{
"anything": {
"match": "*",
"mapping": {
"index": true,
"type": "text",
"analyzer": "ngram_analyzer"
}
}
}],
"properties": {
"email": {
"type": "text",
"index": true,
"analyzer": "ngram_analyzer"
},
"fullName": {
"type": "text",
"index": true,
"analyzer": "ngram_analyzer"
},
"gender": {
"type": "text",
"index": true,
"analyzer": "ngram_analyzer",
"fields": {
"raw": {
"type": "keyword",
"normalizer": "lowercase_normalizer"
}}},
"mongoId": {
"type": "text",
"index": false
},
"login": {
"type": "Integer",
"index": false
}
}
}}'
Case Insensitive Aggregates
Consider these values [‘Male’, ‘male’, ‘Female’, ‘female’] associated with the gender field. In the absence of a normalizer, aggregation returns non-distinct values [‘Male’, ’male’, ’Female’, ’female’].
In the above mapping, lowercase_normalizer is added to the keyword field to perform case insensitive aggregation [ ‘male’, ‘female’].
Sample Query to get unique values of the field
curl — location — request POST 'http://localhost:9200/users/_search' \ — header 'Content-Type: application/json' \
— data-raw '{
"query": {
"match": {
"email": "gmail"
}
},
"aggregations": {
"gender": {
"terms": {
"field": "gender.raw",
"size": 10
}
}
},
"size":0
}'
The above query example is to fetch unique values for the field gender of the documents whose email field contains the word Gmail. The above query has 3 parts in the total :
- Query: It is the query that filters the documents based on the condition on which the aggregation has to be performed. It is optional.
- Aggregations: It encapsulates an aggregation name and term query for that aggregation. The term query specifies the field on which aggregation has to performed and size param which specifies the number of unique field values to be returned.
- Size: It will be usually be confused with size parameter inside the aggregations. This size specifies the number of matching documents to be returned in the response. Omitting this option would return the list of all matching documents instead of only the unique field values in the response.
The response for the above aggregation query is :
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"gender": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "male",
"doc_count": 2
},
{
"key": "female",
"doc_count": 1
}
]
}
}
}
From the response of the aggregation query, you can think of the documents for each gender as being sorted into “buckets”, with a bucket for each gender. Each bucket contains the gender and doc_count which specifies the number of documents of a given gender.
Wrapping Up
As we have seen, Elastic search has good support for running complex aggregate queries on the indexed fields and valuable addition to the Fulltext search features we discussed before. While we have covered a few aspects of the Fulltext search that we have experienced, ElasticSearch does have several other tokenizers that one may adopt based on their use cases.
In the upcoming blog, we will cover a few frequent scenarios one may come across if you have ElasticSearch as part of your infrastructure.
Also, check our post on ElasticSearch Mappings