Top 5 Elastic aggregations to empower your charts!
Elasticsearch is an excellent tool for indexing large amounts of static data to perform searches and many other analytical queries.
Whether you are a data scientist, an engineer, or simply someone who is interested in gaining a better understanding of your data, Elastic aggregations provide a powerful and flexible toolset that you can use to extract valuable insights and make more informed decisions. They allow you to uncover valuable insights and trends that would be difficult or impossible to detect using other methods.
You are probably aware of some basic aggregations like avg
sum
, max
or a bit more advanced bucket_sort
or terms
. They are obvious and do not show all the power of Elasticsearch.
So without further ado let’s dive into some non-trivial aggregations provided by Elastic!
Setup
To experiment alongside this article you can both create a free Elastic cluster — here or run an Elastic with Kibana locally. I suggest using docker for this.
Create a docker network for Kibana and elastic:
docker network create elastic
Launch elastic search without security options:
docker run \
--name elasticsearch \
--net elastic \
-p 9200:9200 \
-e discovery.type=single-node \
-e ES_JAVA_OPTS="-Xms1g -Xmx1g"\
-e xpack.security.enabled=false \
-it \
docker.elastic.co/elasticsearch/elasticsearch:8.5.3
Spin-up Kibana:
docker run \
--name kibana \
--net elastic \
-p 5601:5601 \
docker.elastic.co/kibana/kibana:8.5.2
Profit, now go to http://localhost:5601/app/home#/ and upload some sample data, click “ Try sample data”. In the current article, I’m using e-commerce data same.
Bucket aggregations
In Elasticsearch, a bucket is a type of aggregation that groups documents together based on specific criteria. For example, you could use a bucket aggregation to group all of the documents in an index by a specific field, such as the date on which they were created.
Date histogram
It is used to group documents in time intervals by hour, day, week, etc. Example:
GET /kibana_sample_data_ecommerce/_search
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"chart": {
"date_histogram": {
"field": "order_date",
"calendar_interval": "1d"
}
}
}
}
This is an Elasticsearch aggregation that uses the date_histogram
aggregation to group documents by the order_date
field, using a calendar interval of one day. This means that the aggregation will create one bucket for each day, and each bucket will contain all documents that have an order_date
on that day.
This aggregation can be used to create a chart that shows the number of orders that were done each day.
It can help to make correlations with events or product types that were launched that day.
Significant terms
The significant_terms
aggregation is a type of bucket aggregation that groups documents by the values in a specified field and then calculates which values are statistically significant for the overall set of documents. This can be useful for identifying which values are over-represented or under-represented in your data.
Let’s experiment and make a nested aggregation with histogram_date:
GET /kibana_sample_data_ecommerce/_search
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"chart": {
"date_histogram": {
"field": "order_date",
"calendar_interval": "1w"
},
"aggs": {
"significant_products_categories": {
"significant_terms": {
"field": "products.category.keyword"
}
}
}
}
}
}
In this aggregation, we will see categories of products that are anomalies bigger or smaller during the exact week rather than during the whole period of time.
In this case, we see categories that mostly occurred during weekly sales.
Metric aggregation
Metrics are another type of aggregation in Elasticsearch that are used to calculate metrics such as the average, minimum, etc for a field in a group of documents. But there are really interesting and rare aggregations implemented in elastic that you can benefit from!
Percentiles
Calculates the percentile rank for a field in a set of documents. This is useful for identifying the most common values in a field or for determining the distribution of values in a data set.
The aggregation returns the percentiles as a set of key-value pairs, where the keys are the percentile values and the values are the corresponding field values.
For example, the 95th percentile is the value that is greater than 95% of the observed values.
GET /kibana_sample_data_ecommerce/_search
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"countries": {
"terms": {
"field": "geoip.continent_name"
},
"aggs": {
"percentile": {
"percentiles": {
"field": "taxful_total_price"
}
}
}
}
}
}
This query allows the user to see the distribution of taxful_total_price
values for each continent:
Now we can find the maximum price for 7 distributions: 1, 5, 25, 50, 95, and 99 percent of orders on each continent.
Cardinality
This aggregation calculates the number of unique values in a field. This is useful for finding the number of distinct values in a field, such as the number of unique users or products in a dataset.
The aggregation returns a single value, which is the number of unique values in the specified field. You can use the cardinality aggregation in combination with other aggregations:
GET /kibana_sample_data_ecommerce/_search
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"chart": {
"date_histogram": {
"field": "order_date",
"calendar_interval": "1d"
},
"aggs": {
"unique": {
"cardinality": {
"field": "customer_id"
}
}
}
}
}
}
This query uses date_histogram
aggregation to group documents by the timestamp
field, using a calendar interval of one day. It also has a sub-aggregation called "unique" that uses the cardinality
aggregation to calculate the number of unique values in the customer_id
the field for each bucket in the date_histogram
.
This query can be used to create a chart that shows the number of unique users that performed orders for each day:
Pipeline aggregation
Pipeline aggregations are used to process the output of other aggregations.
For example, you could use a pipeline aggregation to sort the results of a bucket aggregation by the average price of the products in each group or to calculate the cumulative sum of the total number of products in each group.
Cumulative sum
This aggregation calculates the cumulative sum of numeric values in a field over a set of documents. This means that the aggregation will calculate the value of the field by summing up the previous value with the current one.
The result of the aggregation will be a single value representing the cumulative sum of the values in the field. Now let’s take a look at this aggregation in a bit sophisticated query.
Example:
GET /kibana_sample_data_ecommerce/_search
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"country": {
"terms": {
"field": "geoip.continent_name"
},
"aggs": {
"chart": {
"date_histogram": {
"field": "order_date",
"calendar_interval": "1w"
},
"aggs": {
"sales": {
"sum": {
"field": "taxful_total_price"
}
},
"cumulative": {
"cumulative_sum": {
"buckets_path": "sales"
}
}
}
}
}
}
}
}
An aggregation called “country” uses terms
to group the documents by the continent name field in the geoip
object. It produces a set of buckets, one for each unique continent name in the index.
The “chart” aggregation, which is nested within the “country”, uses a date_histogram
to group the documents within each continent bucket by the order date. It produces a set of sub-buckets, one for each week in the time range of the data.
The “sales” aggregation, which is nested within the “chart”, uses a sum
to calculate the total sales for each week by summing the values in the taxful_total_price
field.
Finally, the “cumulative” aggregation will keep track of the total sales for each week and will add the sales of each subsequent week to the total as it processes the data.
In this example, we can see and compare how much money was earned each week on each continent.
BONUS — Derivative
Derivative aggregation refers to the process of using derivatives, which are mathematical concepts that describe the rate of change of a function, to analyze and summarize data. This can be useful for a variety of applications, such as identifying trends in a data set or calculating summary statistics.
Let’s take a look at the same aggregation as the previous one but using a derivative instead of a cumulative sum:
GET /kibana_sample_data_ecommerce/_search
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"country": {
"terms": {
"field": "geoip.continent_name"
},
"aggs": {
"chart": {
"date_histogram": {
"field": "order_date",
"calendar_interval": "1w"
},
"aggs": {
"sales": {
"sum": {
"field": "taxful_total_price"
}
},
"derivative": {
"derivative": {
"buckets_path": "sales"
}
}
}
}
}
}
}
}
In this case, the derivative aggregation is being applied to the sum of the “taxful_total_price” field, which is the sales metric.
It allows you to track the rate of change for a metric over time. Derivative calculates the difference between the current and previous values. This can be useful for identifying trends and anomalies in your data:
In this case, the derivative aggregation can help you see how the sales numbers are changing from week to week. If the derivative values are consistently increasing, that could indicate that sales are growing. On the other hand, if the derivative values are consistently decreasing, that could indicate that sales are declining.
Wrap up
date_histogram
: Groups documents in time intervals (e.g. by the hour, day, week, etc.).
significant_terms
: Identifies unusual occurrences of terms in a given
percentiles
: Calculates the percentiles of a specified field.
cardinality
: Estimates the number of unique values in a field
cumulative_sum
: Calculates the cumulative sum of a specified metric.
derivative
: Calculates the difference between the current and previous values of a specified metric.
Thanks for reading, and happy Elasticsearching!
Want to connect?
Follow me on Twitter! or read more: