Elasticsearch — Documents aggregations

Eleonora Fontana
Feb 22 · 7 min read
Image for post
Image for post
Photo by Jessica Johnston on Unsplash

Introduction

In this article we will discuss how to aggregate the documents of an index. If you are not familiar with the Elasticsearch engine, we recommend to check the articles available at our publication.

In the first section we will provide a general introduction to the topic and create an example index to test what we will learn, whereas in the other sections we will go though different types of aggregations and how to perform them.

Metric aggregations

Elasticsearch organizes aggregations into three categories:

In this article we will only discuss the first two kinds of aggregations since the pipeline ones are more complex and you probably will never need them.

First of all, we should to create a new index for all the examples we will go through. It will be named “order” and you can defined using the request available here. Its documents will have the following fields:

The next step is to index some documents. You can do so with the request available here.

The basic structure of an aggregation request in Elasticsearch is the following:

GET /indexName/_search
{
"aggs": {
"aggregationName": { # name we will also see in the results
"aggregationType": {
… # specifics setting based on the aggregation we choose
}
}
}
}

As a first example, we would like to use the cardinality aggregation in order to know the the total number of salesman. We could achieve this by running the following request:

GET /order/_search
{
"size": 0,
"aggs": {
"total_salesmen": {
"cardinality": {
"field": "salesman.id"
}
}
}
}

Bucket aggregations

The bucket aggregation is used to create document buckets based on some criteria. We can identify the resulting buckets with the “key” field. This would be useful if we wanted to look for distributions in our data. It is closely related to the GROUP BY clause in SQL.

For example, we can create buckets of orders that have the “status” field equal to a specific value:

GET /order/_search
{
"size": 0,
"aggs": {
"status_terms": {
"terms": {
"field": "status",
"order": {
"_key": "asc"
}
}
}
}
}

Note that if there are documents with missing or null value for the field used to aggregate, we can set a key name to create a bucket with them: "missing": "missingName".

We can specify a minimum number of documents in order for a bucket to be created. It is equal to 1 by default and can be modified by the min_doc_count parameter.

We can also specify how to order the results: "order": { "key": "asc" }.

This kind of aggregation needs to be handled with care, because the document count might not be accurate: since Elasticsearch is distributed by design, the coordinating node interrogates all the shards and gets the top results from each of them. In the case of unbalanced document distribution between shards, this could lead to approximate results. To better understand, suppose we have the following number of documents per product in each shard:

Image for post
Image for post

Imagine that the search engine only looked at the top 3 results from each shards, even though by default each shard returns the top 10 results. The bucket aggregation response would then contain a mismatch in some cases:

Image for post
Image for post

As a consequence of this behaviour, Elasticsearch provides us with two new keys into the query results:

Another thing we may need is to define buckets based on a given rule, similarly to what we would obtain in SQL by filtering the result of a GROUP BY query with a WHERE clause. For example we can place documents into buckets based on weather the order status is “cancelled” or “completed”:

GET /order/_search
{
"size": 0,
"aggs": {
"my_filter": {
"filters": {
"filters": {
"cancelled_orders": {
"match": {
"status": "cancelled"
}
},
"completed_orders": {
"match": {
"status": "completed"
}
}
}
}
}
}
}

It is then possible to add an aggregation at the same level of the first “filters”:

GET /order/_search
{
"size": 0,
"aggs": {
"my_filter": {
"filters": {
"filters": {
"cancelled_orders": {
"match": {
"status": "cancelled"
}
},
"completed_orders": {
"match": {
"status": "completed"
}
}
}
},
"aggs": {
"avg_total_amount": {
"avg": {
"field": "total_amount"
}
}
}
}
}
}

Sub-aggregations

In Elasticsearch it is possible to perform sub-aggregations as well by only nesting them into our request:

GET /order/_search
{
"size": 0,
"aggs": {
"status_terms": {
"terms": {
"field": "status"
},
"aggs": {
"status_stats": {
"stats": {
"field": "total_amount"
}
}
}
}
}
}

What we did was to create buckets using the “status” field and then retrieve statistics for each set of orders via the stats aggregation.

Note that we can add all the queries we need to filter the documents before performing aggregation. For example, the last request can be executed only on the orders which have the “total_amount” value greater than 100:

GET /order/_search
{
"size": 0,
"query": {
"range": {
"total_amount": {
"gte": 100
}
}
},
"aggs": {
"status_terms": {
"terms": {
"field": "status"
},
"aggs": {
"status_stats": {
"stats": {
"field": "total_amount"
}
}
}
}
}
}

Range aggregations

There are two types of range aggregation, range and date_range, which are both used to define buckets using range criteria. The date_range is dedicated to the date type and allows date math expressions.

An example of range aggregation could be to aggregate orders based on their “total_amount” value:

GET /order/_search
{
"size": 0,
"aggs": {
"amount_distribution": {
"range": {
"field": "total_amount",
"ranges": [
{
"to": 50
},
{
"from": 50,
"to": 100
},
{
"from": 100
}
]
}
}
}
}

The bucket name is shown in the response as the “key” field of each bucket. You can set the keyed parameter of the range aggregation to true in order to see the bucket name as the key of each object. You can also specify a name for each bucket with "key": "bucketName" into the objects contained in the ranges array of the aggregation.

Note that the from value used in the request is included in the bucket, whereas the to value is excluded from it.

The date_range aggregation has the same structure as the range one, but allows date math expressions. Let’s divide orders based on the purchase date and set the date format to “yyyy-MM-dd”:

GET /order/_search
{
"size": 0,
"aggs": {
"purchased_ranges": {
"date_range": {
"field": "purchased_at",
"format": "yyyy-MM-dd",
"keyed": true,
"ranges": [
{
"from": "2016-01-01",
"to": "2016-01-01||+6M"
},
{
"from": "2016-01-01||+6M",
"to": "2016-01-01||+1y"
}
]
}
}
}
}

Histogram aggregations

We just learnt how to define buckets based on ranges, but what if we don’t know the minimum or maximum value of the field? Elasticsearch offers the possibility to define buckets based on intervals using the histogram aggregation:

GET /order/_search
{
"size": 0,
"aggs": {
"amount_distribution": {
"histogram": {
"field": "total_amount",
"interval": 25
}
}
}
}

By default Elasticsearch creates buckets for each interval, even if there are no documents in it. You can change this behavior setting the min_doc_count parameter to a value greater than zero.

Be aware that if you perform a query before a histogram aggregation, only the documents returned by the query will be aggregated. You can avoid it and execute the aggregation on all documents by specifying a min and max values for it in the extended_bounds parameter:

GET /order/_search
{
"size": 0,
"query": {
"range": {
"total_amount": {
"gte": 100
}
}
},
"aggs": {
"amount_distribution": {
"histogram": {
"field": "total_amount",
"interval": 25,
"min_doc_count": 0,
"extended_bounds": {
"min": 0,
"max": 500
}
}
}
}
}

Similarly to what was explained in the previous section, there is a date_histogram aggregation as well. It supports date expressions into the interval parameter, such as year, quarter, month, etc. As already mentioned, the date format can be modified via the format parameter.

Global aggregations

We already discussed that if there is a query before an aggregation, the latter will only be executed on the query results. Nevertheless, the global aggregation is a way to break out of the aggregation context and aggregate all documents, even though there was a query before it.

The structure is very simple and the same as before:

GET /indexName/_search
{
"query": { … },
"size": 0,
"aggs": {
"aggName1": {
"global": { },
"aggs": {
"aggName2": { … }
}
}
}
}

Missing field values

The missing aggregation creates a bucket of all documents that have a missing or null field value:

GET /indexName/_search
{
"size": 0,
"aggs": {
"missingAggName": {
"missing": {
"field": "fieldName"
}
}
}
}

Nested aggregations

We can aggregate nested objects as well via the nested aggregation. For example, let’s look for the maximum value of the “amount” field which is in the nested objects contained in the “lines” field:

GET /order/_search
{
"size": 0,
"aggs": {
"linesAgg": {
"nested": {
"path": "lines"
},
"aggs": {
"maxAmount": {
"max": {
"field": "lines.amount"
}
}
}
}
}
}

Conclusion

You should now be able to perform different aggregations and compute some metrics on your documents. As always, we recommend you to try new examples and explore your data using what you learnt today.

Remember to subscribe to the Betacom publication 👆 and give us some claps 👏 if you enjoyed the article!

Betacom

We do IT — betacom.eu

Thanks to Valerio Poggio

Eleonora Fontana

Written by

Betacom

Betacom

Betacom team is made up of IT professionals; we operate in the IT field using innovative technologies, digital solutions and cutting-edge programming methodologies.

Eleonora Fontana

Written by

Betacom

Betacom

Betacom team is made up of IT professionals; we operate in the IT field using innovative technologies, digital solutions and cutting-edge programming methodologies.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store