Kirill Goltsman
Feb 21 · 14 min read

In a previous tutorial, we discussed the structure of Elasticsearch pipeline aggregations and walked you through setting up several common pipelines such as derivatives, cumulative sums, and avg bucket aggregations.

In this article, we’ll continue with the analysis of Elasticsearch pipeline aggregations, focusing on such pipelines as stats, moving averages and moving functions, percentiles, bucket sorts, and bucket scripts, among others. Some of the pipeline aggregations discussed in the article such as moving averages are supported in Kibana, so we’ll show you how to visualize them as well. Let’s get started!

Tutorial

Examples in this tutorial were tested in the following environment:

  • Elasticsearch 6.4.0
  • Kibana 6.4.0

For this tutorial, we’re going to create an index with the traffic data. The index mapping will contain three fields: date, visits, and max_time_spent. Here is the index mapping:

curl -XPUT "http://localhost:9200/traffic_stats/" -H "Content-Type: application/json" -d '{
"mappings":{
"blog":{
"properties":{
"date":{
"type":"date",
"format":"dateOptionalTime"
},
"visits":{
"type":"integer"
},
"max_time_spent":{
"type":"integer"
}
}
}
}
}'

Once the index mapping is created, let’s save some arbitrary data to it:

curl -XPOST "http://localhost:9200/traffic_stats/_bulk" -H "Content-Type: application/json" -d' 
{"index":{"_index":"traffic_stats","_type":"blog"}}
{"visits":"488", "date":"2018-10-1", "max_time_spent":"900"} {"index":{"_index":"traffic_stats","_type":"blog"}}
{"visits":"783", "date":"2018-10-6", "max_time_spent":"928"} {"index":{"_index":"traffic_stats","_type":"blog"}}
{"visits":"789", "date":"2018-10-12", "max_time_spent":"1834"} {"index":{"_index":"traffic_stats","_type":"blog"}} {"visits":"1299", "date":"2018-11-3", "max_time_spent":"592"} {"index":{"_index":"traffic_stats","_type":"blog"}}
{"visits":"394", "date":"2018-11-6", "max_time_spent":"1249"} {"index":{"_index":"traffic_stats","_type":"blog"}}
{"visits":"448", "date":"2018-11-24", "max_time_spent":"874"} {"index":{"_index":"traffic_stats","_type":"blog"}}
{"visits":"768", "date":"2018-12-18", "max_time_spent":"876"} {"index":{"_index":"traffic_stats","_type":"blog"}} {"visits":"1194", "date":"2018-12-24", "max_time_spent":"1249"} {"index":{"_index":"traffic_stats","_type":"blog"}} {"visits":"987", "date":"2018-12-28", "max_time_spent":"1599"} {"index":{"_index":"traffic_stats","_type":"blog"}}
{"visits":"872", "date":"2019-01-1", "max_time_spent":"828"} {"index":{"_index":"traffic_stats","_type":"blog"}}
{"visits":"972", "date":"2019-01-5", "max_time_spent":"723"} {"index":{"_index":"traffic_stats","_type":"blog"}}
{"visits":"827", "date":"2019-02-5", "max_time_spent":"1300"} {"index":{"_index":"traffic_stats","_type":"blog"}} {"visits":"1584", "date":"2019-02-15", "max_time_spent":"1500"} {"index":{"_index":"traffic_stats","_type":"blog"}} {"visits":"1604", "date":"2019-03-2", "max_time_spent":"1488"} {"index":{"_index":"traffic_stats","_type":"blog"}} {"visits":"1499", "date":"2019-03-27", "max_time_spent":"1399"} {"index":{"_index":"traffic_stats","_type":"blog"}} {"visits":"1392", "date":"2019-04-8", "max_time_spent":"1294"} {"index":{"_index":"traffic_stats","_type":"blog"}} {"visits":"1247", "date":"2019-04-15", "max_time_spent":"1194"} {"index":{"_index":"traffic_stats","_type":"blog"}}
{"visits":"984", "date":"2019-05-15", "max_time_spent":"1184"} {"index":{"_index":"traffic_stats","_type":"blog"}} {"visits":"1228", "date":"2019-05-18", "max_time_spent":"1485"} {"index":{"_index":"traffic_stats","_type":"blog"}} {"visits":"1423", "date":"2019-06-14", "max_time_spent":"1452"} {"index":{"_index":"traffic_stats","_type":"blog"}} {"visits":"1238", "date":"2019-06-24", "max_time_spent":"1329"} {"index":{"_index":"traffic_stats","_type":"blog"}} {"visits":"1388", "date":"2019-07-14", "max_time_spent":"1542"} {"index":{"_index":"traffic_stats","_type":"blog"}} {"visits":"1499", "date":"2019-07-24", "max_time_spent":"1742"} {"index":{"_index":"traffic_stats","_type":"blog"}} {"visits":"1523", "date":"2019-08-13", "max_time_spent":"1552"} {"index":{"_index":"traffic_stats","_type":"blog"}} {"visits":"1443", "date":"2019-08-19", "max_time_spent":"1511"} {"index":{"_index":"traffic_stats","_type":"blog"}} {"visits":"1587", "date":"2019-09-14", "max_time_spent":"1497"} {"index":{"_index":"traffic_stats","_type":"blog"}} {"visits":"1534", "date":"2019-09-27", "max_time_spent":"1434"} '

Great! Now, it’s all set to illustrate some pipeline aggregations. Let’s start with the stats bucket aggregation.

Stats Bucket Aggregation

As we discussed in the metrics aggregation series, stats aggregation calculates a set of statistical measures such as min, max, avg, and sum for some numeric field in your index.

In Elasticsearch, it’s also possible to calculate stats for buckets generated by some other aggregation. This is very useful when the values required by the stats aggregation must be first computed per bucket using some other aggregation.

To understand the idea, let’s look at the following example:

curl -X POST "localhost:9200/traffic_stats/_search?size=0&pretty" -H 'Content-Type: application/json' -d '{
"aggs":{
"visits_per_month":{
"date_histogram":{
"field":"date",
"interval":"month"
},
"aggs":{
"total_visits":{
"sum":{
"field":"visits"
}
}
}
},
"stats_monthly_visits":{
"stats_bucket":{
"buckets_path":"visits_per_month>total_visits"
}
}
}
}'

In this query, we first generate a date histogram for the “visits” field and compute total monthly visits for each generated bucket. Next, we use the stats pipeline to calculate statistics for each of the produced buckets. The response should look something like this:

"aggregations":{
"visits_per_month":{
"buckets":[
.... {
"key_as_string":"2019-07-01T00:00:00.000Z",
"key":1561939200000,
"doc_count":2,
"total_visits":{
"value":2887.0
}
},
{
"key_as_string":"2019-08-01T00:00:00.000Z",
"key":1564617600000,
"doc_count":2,
"total_visits":{
"value":2966.0
}
},
{
"key_as_string":"2019-09-01T00:00:00.000Z",
"key":1567296000000,
"doc_count":2,
"total_visits":{
"value":3121.0
}
}
]
},
"stats_monthly_visits":{
"count":12,
"min":1844.0,
"max":3121.0,
"avg":2582.8333333333335,
"sum":30994.0
}
}

Under the hood, the stats aggregation performs min, max, avg, and sum pipeline aggregations on buckets generated by the date histogram, computes the results, and then reflects them at the end of the response.

The extended stats bucket aggregation has the same logic except that it returns additional statistics like variance, sum of squares, standard deviation, and standard deviation bounds. Let’s slightly adjust the query above to use the extended stats pipeline:

curl -X POST "localhost:9200/traffic_stats/_search?size=0&pretty" -H 'Content-Type: application/json' -d '{
"aggs":{
"visits_per_month":{
"date_histogram":{
"field":"date",
"interval":"month"
},
"aggs":{
"total_visits":{
"sum":{
"field":"visits"
}
}
}
},
"stats_monthly_visits":{
"extended_stats_bucket":{
"buckets_path":"visits_per_month>total_visits"
}
}
}
}'

The response will contain more statistical information than a simple stats bucket aggregation:

"stats_monthly_visits":{
"count":12,
"min":1844.0,
"max":3121.0,
"avg":2582.8333333333335,
"sum":30994.0,
"sum_of_squares":8.21767E7,
"variance":177030.30555555597,
"std_deviation":420.7496946588981,
"std_deviation_bounds":{
"upper":3424.3327226511296,
"lower":1741.3339440155373
}
}

Kibana supports visualization of standard deviation bounds, which you may see in the above response. In the image below, we see the upper (blue) and lower (green) standard deviation bounds for each bucket generated by the date histogram:

Percentiles Bucket Aggregation

As you remember, a percentile is a statistical measure that indicates the value below which a given percentage of observations in a group fall. For example, the 65th percentile is the value below which 65% of observations may be found.

A simple percentiles metric aggregation calculates percentiles for values directly available in your index data. However, there are situations when you want to generate buckets using a date histogram and calculate values for these buckets before applying percentiles.

This is exactly the case when we can use the percentiles bucket pipeline that works on buckets generated by the parent aggregation and some metrics computed by the sibling aggregation. Take a look at the example below:

curl -X POST "localhost:9200/traffic_stats/_search?size=0&pretty" -H 'Content-Type: application/json' -d '{
"aggs":{
"visits_per_month":{
"date_histogram":{
"field":"date",
"interval":"month"
},
"aggs":{
"total_visits":{
"sum":{
"field":"visits"
}
}
}
},
"percentiles_monthly_visits":{
"percentiles_bucket":{
"buckets_path":"visits_per_month>total_visits",
"percents":[
15.0,
50.0,
75.0,
99.0
]
}
}
}
}'

Here, we calculate percentiles for the output of the total_visits metrics in each bucket generated by the date histogram. Similarly to the regular percentile aggregation, the percentiles pipeline aggregation allows specifying a set of percentiles to be returned. In this case, we chose to calculate the 15th, 50th, 75th and 99th percentiles, which is reflected in the percents field of the aggregation.

The query above should return the following response:

"percentiles_monthly_visits":{
"values":{
"15.0":2141.0,
"50.0":2661.0,
"75.0":2949.0,
"99.0":3121.0
}
}

For example, this data shows that 99% of all monthly visits in our buckets are below 3121.

Moving Average Aggregation

A moving average or rolling average is a calculation technique that constructs series of averages of different subsets of the full data set. A subset is often termed as a window of a certain size. In fact, a window’s size represents the number of data points covered by the window on each iteration. On each iteration, the algorithm calculates the average for all data points that fit into the window and then slides forward by excluding the first member of the previous subset and including the first member from the next subset. That’s why we call this average a moving average.

For example, given the data [1, 5, 8, 23, 34, 28, 7, 23, 20, 19], we can calculate a simple moving average with a window's size of 5 as follows:

  • (1 + 5 + 8 + 23 + 34) / 5 = 14.2
  • (5 + 8 + 23 + 34+ 28) / 5 = 19.6
  • (8 + 23 + 34 + 28 + 7) / 5 = 20
  • so on

A moving average is often used with time series data (e.g stock market charts) to smooth out short-term fluctuations and to highlight longer-term trends or cycles. The smoothing is often used to eliminate high-frequency fluctuations or random noise because it makes lower frequency trends more visible.

Supported Moving Average Models

The moving_avg aggregation supports five moving average "models:" simple, linear, exponentially weighted, holt-linear, and holt-winters. These models differ in how the values of the window are weighted.

As data points become “older” (i.e., the window slides away from them), they may be weighted differently. You can specify a model of your choice by setting the model parameter of the aggregation.

In what follows, we will discuss simple, linear, and exponentially weighted models that are good for most use cases. For more information about the available models, please consult the official Elasticsearch documentation.

Simple Model

The simple model first calculates the sum of all data points in the window, and then divides that sum by the size of the window. In other words, a simple model calculates a simple arithmetic mean for each window in your data set.

In the example below, we use a simple model with a window size of 30. The aggregation will compute the moving average for all buckets generated by the date histogram:

curl -X POST "localhost:9200/traffic_stats/_search?size=0&pretty" -H 'Content-Type: application/json' -d '{
"aggs":{
"visits_per_month":{
"date_histogram":{
"field":"date",
"interval":"month"
},
"aggs":{
"total_visits":{
"sum":{
"field":"visits"
}
},
"the_movavg":{
"moving_avg":{
"buckets_path":"total_visits",
"window":30,
"model":"simple"
}
}
}
}
}
}'

The response should look something like this:

"aggregations":{
"visits_per_month":{
"buckets":[
{
"key_as_string":"2019-08-01T00:00:00.000Z",
"key":1564617600000,
"doc_count":2,
"total_visits":{
"value":2966.0
},
"the_movavg":{
"value":2490.7
}
},
{
"key_as_string":"2019-09-01T00:00:00.000Z",
"key":1567296000000,
"doc_count":2,
"total_visits":{
"value":3121.0
},
"the_movavg":{
"value":2533.909090909091
}
}
]
}
}

Please note that the window size can change the behavior of your moving average. A small window size ("window": 10) will closely follow the data and only smooth out small-scale fluctuations. In contrast, a simple moving average with a larger window ("window": 100) will smooth out all higher-frequency fluctuations, leaving only low-frequency, long-term trends. It also tends to "lag" behind the actual data by a substantial amount.

Linear Model

This model assigns different linear weights to data points in a sequence, so that “older” data points (i.e., those closer to the beginning of the window) contribute less to the final average computation. Such an approach allows reducing the “lag” behind the data’s mean because older data points have less influence on the final result.

As in the simple model, smaller windows tend to smooth out small-scale fluctuations, whereas larger windows tend to smooth out only higher-frequency fluctuations.

Also, similarly to the simple model, linear model tends to “lag” behind the actual data, although to a lesser extent than in the simple model.

In the example below, we use a linear model with a window’s size of 30:

curl -X POST "localhost:9200/traffic_stats/_search?size=0&pretty" -H 'Content-Type: application/json' -d '{
"aggs":{
"visits_per_month":{
"date_histogram":{
"field":"date",
"interval":"month"
},
"aggs":{
"total_visits":{
"sum":{
"field":"visits"
}
},
"the_movavg":{
"moving_avg":{
"buckets_path":"total_visits",
"window":30,
"model":"linear"
}
}
}
}
}
}'

The response should be:

"aggregations":{
"visits_per_month":{
"buckets":[
{
"key_as_string":"2019-08-01T00:00:00.000Z",
"key":1564617600000,
"doc_count":2,
"total_visits":{
"value":2966.0
},
"the_movavg":{
"value":2539.75
}
},
{
"key_as_string":"2019-09-01T00:00:00.000Z",
"key":1567296000000,
"doc_count":2,
"total_visits":{
"value":3121.0
},
"the_movavg":{
"value":2609.731343283582
}
}
]
}
}

Exponentially Weighted Moving Average (EWMA)

This model has the same behavior as the linear model except that “older” data points decrease in importance exponentially — not linearly. The speed at which the importance of “older” points decreases can be controlled with an alpha setting. With small values of alpha, the weights decay slowly, which provides better smoothing. In contrast, larger values of alpha make the "older" data points decrease in importance very quickly, reducing their impact on the moving average. The default value of alpha is 0.3, and the setting accepts any float from 0 to 1 inclusive.

As you see in the query above, the EWMA aggregation has an additional “settings” object where the alpha can be defined:

curl -X POST "localhost:9200/traffic_stats/_search?size=0&pretty" -H 'Content-Type: application/json' -d '{
"aggs":{
"visits_per_month":{
"date_histogram":{
"field":"date",
"interval":"month"
},
"aggs":{
"total_visits":{
"sum":{
"field":"visits"
}
},
"the_movavg":{
"moving_avg":{
"buckets_path":"total_visits",
"window":30,
"model":"ewma",
"settings":{
"alpha":0.5
}
}
}
}
}
}
}'

The pipeline should produce the following response:

"aggregations":{
"visits_per_month":{
"buckets":[
{
"key_as_string":"2019-08-01T00:00:00.000Z",
"key":1564617600000,
"doc_count":2,
"total_visits":{
"value":2966.0
},
"the_movavg":{
"value":2718.958984375
}
},
{
"key_as_string":"2019-09-01T00:00:00.000Z",
"key":1567296000000,
"doc_count":2,
"total_visits":{
"value":3121.0
},
"the_movavg":{
"value":2842.4794921875
}
}
]
}
}

Extrapolation/Prediction

Sometimes, you may want to extrapolate the behavior of your data based on the current trends. All the moving average models support a “prediction” mode which attempts to predict the movement of data given the current smoothed, moving average. Depending on the model and parameters, these predictions may or may not be accurate. The simple, linear and ewma models all produce "flat" predictions converging on the mean of the last value in a set.

You can use a predict parameter to specify how many predictions you would like to append to the end of the series. These predictions will be spaced out at the same interval as your buckets. For example, for the linear model:

curl -X POST "localhost:9200/traffic_stats/_search?size=0&pretty" -H 'Content-Type: application/json' -d '{
"aggs":{
"visits_per_month":{
"date_histogram":{
"field":"date",
"interval":"month"
},
"aggs":{
"total_visits":{
"sum":{
"field":"visits"
}
},
"the_movavg":{
"moving_avg":{
"buckets_path":"total_visits",
"window":30,
"model":"linear",
"predict":3
}
}
}
}
}
}'

This query will add 3 predictions to the end of the buckets list:

"aggregations":{
"visits_per_month":{
"buckets":[
... {
"key_as_string":"2019-10-01T00:00:00.000Z",
"key":1569888000000,
"doc_count":0,
"the_movavg":{
"value":2687.3924050632913
}
},
{
"key_as_string":"2019-11-01T00:00:00.000Z",
"key":1572566400000,
"doc_count":0,
"the_movavg":{
"value":2687.3924050632913
}
},
{
"key_as_string":"2019-12-01T00:00:00.000Z",
"key":1575158400000,
"doc_count":0,
"the_movavg":{
"value":2687.3924050632913
}
}
]
}
}

Awesome! We have predicted the site traffic for three months ahead. As you see, the predictions are “flat.” That is, they return the same value for all 3 months. If you want to extrapolate based on local or global constant trends, you should opt for the holt model.

Moving Function Aggregation (Supported Only in Elasticsearch 6.4.0)

Like the moving average aggregation, the moving function aggregation allows working on subsets of data points, gradually sliding the window across your data set. However, the moving function additionally allows specifying a custom script executed on each window of data. Elasticsearch supports such predefined scripts as min/max, moving averages etc.

Let’s look at the standard definition for the moving function aggregation:

{
"moving_fn":{
"buckets_path":"the_sum",
"window":10,
"script":"MovingFunctions.min(values)"
}
}

As you see, we defined a moving function aggregation that uses the built-in min aggregation script that will work on the window of 10 data points. You should take note that moving_fn aggregations must be embedded inside of a histogram or date_histogram aggregation:

curl -X POST "localhost:9200/traffic_stats/_search?size=0&pretty" -H 'Content-Type: application/json' -d '{
"aggs":{
"visits_per_month":{
"date_histogram":{
"field":"date",
"interval":"month"
},
"aggs":{
"total_visits":{
"sum":{
"field":"visits"
}
},
"the_movfn":{
"moving_fn":{
"buckets_path":"total_visits",
"window":10,
"script":"MovingFunctions.unweightedAvg(values)"
}
}
}
}
}
}'

As you see, the moving average is a sibling of the total_visits aggregation and the child of the visits_per_month date histogram.

The discussed aggregation supports the following built-in functions:

  • max()
  • min()
  • sum()
  • stdDev()
  • unweightedAvg()
  • linearWeightedAvg()
  • ewma()
  • holt()
  • holtWinters()

All of them are available from the MovingFunctions namespace (MovingFunctions.min()). We will review these functions in subsequent tutorials. For now, you can find more information in the official Elasticsearch documentation.

Bucket Script Aggregation

This parent pipeline aggregation allows executing a script to perform per-bucket computations on some metrics in the parent multi-bucket aggregation. The specified metric must be numeric, and the script must return a numeric value. The script can be inline, file or indexed.

For example, here we first use min and max metrics aggregations on the buckets generated by the date histogram. The resultant min and max values are then divided by the bucket script aggregation to calculate the min/max ratio for each bucket:

curl -X POST "localhost:9200/traffic_stats/_search?size=0&pretty" -H 'Content-Type: application/json' -d '{
"aggs":{
"visits_per_month":{
"date_histogram":{
"field":"date",
"interval":"month"
},
"aggs":{
"min_visits":{
"min":{
"field":"visits"
}
},
"max_visits":{
"max":{
"field":"visits"
}
},
"min_max_ratio":{
"bucket_script":{
"buckets_path":{
"min_visits":"min_visits",
"max_visits":"max_visits"
},
"script":"params.min_visits / params.max_visits"
}
}
}
}
}
}'

The aggregation computes the min_max_ratio for each bucket and appends the result to the end of the bucket:

"aggregations":{
"visits_per_month":{
"buckets":[
{
"key_as_string":"2018-10-01T00:00:00.000Z",
"key":1538352000000,
"doc_count":3,
"min_visits":{
"value":488.0
},
"max_visits":{
"value":789.0
},
"min_max_ratio":{
"value":0.6185044359949303
}
},
{
"key_as_string":"2018-11-01T00:00:00.000Z",
"key":1541030400000,
"doc_count":3,
"min_visits":{
"value":394.0
},
"max_visits":{
"value":1299.0
},
"min_max_ratio":{
"value":0.30331023864511164
}
},
{
"key_as_string":"2018-12-01T00:00:00.000Z",
"key":1543622400000,
"doc_count":3,
"min_visits":{
"value":768.0
},
"max_visits":{
"value":1194.0
},
"min_max_ratio":{
"value":0.6432160804020101
}
},
{
"key_as_string":"2019-01-01T00:00:00.000Z",
"key":1546300800000,
"doc_count":2,
"min_visits":{
"value":872.0
},
"max_visits":{
"value":972.0
},
"min_max_ratio":{
"value":0.897119341563786
}
},
{
"key_as_string":"2019-02-01T00:00:00.000Z",
"key":1548979200000,
"doc_count":2,
"min_visits":{
"value":827.0
},
"max_visits":{
"value":1584.0
},
"min_max_ratio":{
"value":0.5220959595959596
}
},
{
"key_as_string":"2019-03-01T00:00:00.000Z",
"key":1551398400000,
"doc_count":2,
"min_visits":{
"value":1499.0
},
"max_visits":{
"value":1604.0
},
"min_max_ratio":{
"value":0.9345386533665836
}
},
{
"key_as_string":"2019-04-01T00:00:00.000Z",
"key":1554076800000,
"doc_count":2,
"min_visits":{
"value":1247.0
},
"max_visits":{
"value":1392.0
},
"min_max_ratio":{
"value":0.8958333333333334
}
},
{
"key_as_string":"2019-05-01T00:00:00.000Z",
"key":1556668800000,
"doc_count":2,
"min_visits":{
"value":984.0
},
"max_visits":{
"value":1228.0
},
"min_max_ratio":{
"value":0.8013029315960912
}
},
{
"key_as_string":"2019-06-01T00:00:00.000Z",
"key":1559347200000,
"doc_count":2,
"min_visits":{
"value":1238.0
},
"max_visits":{
"value":1423.0
},
"min_max_ratio":{
"value":0.8699929725931131
}
},
{
"key_as_string":"2019-07-01T00:00:00.000Z",
"key":1561939200000,
"doc_count":2,
"min_visits":{
"value":1388.0
},
"max_visits":{
"value":1499.0
},
"min_max_ratio":{
"value":0.9259506337558372
}
},
{
"key_as_string":"2019-08-01T00:00:00.000Z",
"key":1564617600000,
"doc_count":2,
"min_visits":{
"value":1443.0
},
"max_visits":{
"value":1523.0
},
"min_max_ratio":{
"value":0.9474720945502298
}
},
{
"key_as_string":"2019-09-01T00:00:00.000Z",
"key":1567296000000,
"doc_count":2,
"min_visits":{
"value":1534.0
},
"max_visits":{
"value":1587.0
},
"min_max_ratio":{
"value":0.9666036546943919
}
}
]
}
}

Bucket Selector Aggregation

It’s sometimes useful to filter the buckets returned by your date histogram or other aggregation based on some criteria. In this case, you can use a bucket selector aggregation that contains a script to determine whether the current bucket should be retained in the output of the parent multi-bucket aggregation.

The specified metric must be numeric and the script must return a Boolean value. If the script language is expression, it can return a numeric boolean value. In this case, 0.0 will be evaluated as false , and all other values will evaluate as true.

In the example below, we first calculate the sum of monthly visits and then evaluate if this sum is greater than 3000. If true, then the bucket is retained in the bucket list. Otherwise, it's deleted from the final output:

curl -X POST "localhost:9200/traffic_stats/_search?size=0&pretty" -H 'Content-Type: application/json' -d '{
"aggs":{
"visits_per_month":{
"date_histogram":{
"field":"date",
"interval":"month"
},
"aggs":{
"total_visits":{
"sum":{
"field":"visits"
}
},
"visits_bucket_filter":{
"bucket_selector":{
"buckets_path":{
"total_visits":"total_visits"
},
"script":"params.total_visits > 3000"
}
}
}
}
}
}'

As you see in the response below, the aggregation left only two buckets that matched the rule.

"aggregations":{
"visits_per_month":{
"buckets":[
{
"key_as_string":"2019-03-01T00:00:00.000Z",
"key":1551398400000,
"doc_count":2,
"total_visits":{
"value":3103.0
}
},
{
"key_as_string":"2019-09-01T00:00:00.000Z",
"key":1567296000000,
"doc_count":2,
"total_visits":{
"value":3121.0
}
}
]
}
}

Bucket Sort Aggregation

A bucket sort is a parent pipeline aggregation which sorts the buckets returned by its parent multi-bucket aggregation (e.g date histogram). You can specify several sort fields together with the corresponding sort order. Additionally, you can sort each bucket based on its _key, _count or its sub-aggregations. You can also truncate the result buckets by setting from and size parameters.

In the example below, we sort the buckets of the parent date histogram aggregation based on the computed total_visits values. The buckets are sorted in the descending order so that the buckets with the highest total_visits value are returned first.

curl -X POST "localhost:9200/traffic_stats/_search?size=0&pretty" -H 'Content-Type: application/json' -d '{
"aggs":{
"visits_per_month":{
"date_histogram":{
"field":"date",
"interval":"month"
},
"aggs":{
"total_visits":{
"sum":{
"field":"visits"
}
},
"visits_bucket_sort":{
"bucket_sort":{
"sort":[
{
"total_visits":{
"order":"desc"
}
}
],
"size":5
}
}
}
}
}
}'

As you see, the sort order is specified in the sort field of the aggregation. We also set the size parameter to 5 to return only the top 5 buckets in the response:

"aggregations":{
"visits_per_month":{
"buckets":[
{
"key_as_string":"2019-09-01T00:00:00.000Z",
"key":1567296000000,
"doc_count":2,
"total_visits":{
"value":3121.0
}
},
{
"key_as_string":"2019-03-01T00:00:00.000Z",
"key":1551398400000,
"doc_count":2,
"total_visits":{
"value":3103.0
}
},
{
"key_as_string":"2019-08-01T00:00:00.000Z",
"key":1564617600000,
"doc_count":2,
"total_visits":{
"value":2966.0
}
},
{
"key_as_string":"2018-12-01T00:00:00.000Z",
"key":1543622400000,
"doc_count":3,
"total_visits":{
"value":2949.0
}
},
{
"key_as_string":"2019-07-01T00:00:00.000Z",
"key":1561939200000,
"doc_count":2,
"total_visits":{
"value":2887.0
}
}
]
}
}

We can also use this aggregation to truncate the result buckets without doing any sorting. To do so, just use the from and/or size parameters without sort.

The following example simply truncates the result so that only the second and the third buckets are returned:

curl -X POST "localhost:9200/traffic_stats/_search?size=0&pretty" -H 'Content-Type: application/json' -d '{
"aggs":{
"visits_per_month":{
"date_histogram":{
"field":"date",
"interval":"month"
},
"aggs":{
"total_visits":{
"sum":{
"field":"visits"
}
},
"visits_bucket_sort":{
"bucket_sort":{
"from":2,
"size":2
}
}
}
}
}
}'

And the response should look something like this:

"aggregations":{
"visits_per_month":{
"buckets":[
{
"key_as_string":"2018-12-01T00:00:00.000Z",
"key":1543622400000,
"doc_count":3,
"total_visits":{
"value":2949.0
}
},
{
"key_as_string":"2019-01-01T00:00:00.000Z",
"key":1546300800000,
"doc_count":2,
"total_visits":{
"value":1844.0
}
}
]
}
}

Conclusion

That’s it! At this point, we’ve learned almost all pipeline aggregations supported in Elasticsearch. As we saw, pipeline aggregations help implement complex computations involving intermediary values and buckets produced by other aggregations. You can also leverage the power of Elasticsearch scripting to make programmatic actions with the returned metrics. For example, you can evaluate if the buckets match certain rules and potentially compute any custom metric not available by default (e.g., min/max ratio).

In the future tutorials, we’ll focus on some complex pipeline aggregations (such as serial differencing) that were not addressed in this article. Stay tuned to our blog to find out more!


Originally published at qbox.io.

Qbox Search as a Service

This is the home blog of Qbox, the providers of Hosted Elasticsearch

Kirill Goltsman

Written by

I am a blog writer with the interest in cloud-native technologies and container orchestration.

Qbox Search as a Service

This is the home blog of Qbox, the providers of Hosted Elasticsearch

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade