Kirill Goltsman
Feb 19 · 14 min read

As you might already know from the previous Elasticsearch aggregation series, both metrics and buckets aggregations work on the numeric fields in the document set directly. In contrast to this, pipeline aggregations, which we discuss in this article, work on the output produced by other aggregations transforming the values already computed by them. A pipeline aggregation, hence, works on the intermediary values not present in the original document set. This makes pipeline aggregation very useful for calculating complex statistical and mathematical measures like cumulative sum, derivatives, and moving averages among others.

In the first part of this series, we’ll discuss two basic types of pipeline aggregations and show examples of such common Elasticsearch pipelines as a sum and cumulative sum, min and max, avg bucket, and derivative pipeline aggregations. Let’s get started!

Types of Pipeline Aggregations

There are two broad types of pipeline aggregations in Elasticsearch: parent and sibling pipeline aggregations.

A parent pipeline aggregation works with the output of its parent aggregation. It takes the values of this aggregation and computes new buckets or aggregations adding them to buckets that already exist. Derivatives and cumulative sum aggregation are two common examples of parent pipeline aggregations in Elasticsearch.

In contrast to parent pipelines, sibling aggregations work on the output of a sibling aggregation. They take this output and compute a new aggregation which will be at the same level as the sibling aggregation.

Pipeline aggregations need a way to access the parent or sibling aggregation. They can references the aggregations they need by using the buckets_path parameter that indicates the paths to the required metrics. This parameter has its peculiar syntax that you need to understand:

AGG_SEPARATOR = '>' ; 
METRIC_SEPARATOR = '.' ;
AGG_NAME = <the name of the aggregation> ;
METRIC = <the name of the metric (in case of multi-value metrics aggregation)> ;
PATH = <AGG_NAME> [ <AGG_SEPARATOR>, <AGG_NAME> ]* [ <METRIC_SEPARATOR>, <METRIC> ] ;

For example, the path "my_bucket>my_stats.sum" will target the sum value in the "my_stats" metric, which is included in the "my_bucket" bucket aggregation.

It should be noted that paths are relative from the position of your pipeline aggregation. That’s why the path cannot go back “up” the aggregation tree. For example, this derivative pipeline aggregation is embedded into a date_histogram and refers to a "sibling" metric the_sum:

curl -X POST "localhost:9200/traffic_stats/_search" -H 'Content-Type: application/json' -d '{
"aggs":{
"total_monthly_visits":{
"date_histogram":{
"field":"date",
"interval":"month"
},
"aggs":{
"the_sum":{
"sum":{
"field":"visits"
}
},
"the_derivative":{
"derivative":{
"buckets_path":"the_sum"
}
}
}
}
}
}'

Sibling pipeline aggregations can be also placed “next” to a series of buckets instead of being embedded “inside” them. In this case, to access the needed metric, we need to specify a full path including its parent aggregation:

curl -X POST "localhost:9200/traffic_stats/_search?size=0&pretty" -H 'Content-Type: application/json' -d '{
"aggs":{
"visits_per_month":{
"date_histogram":{
"field":"date",
"interval":"month"
},
"aggs":{
"total_visits":{
"sum":{
"field":"visits"
}
}
}
},
"avg_monthly_visits":{
"avg_bucket":{
"buckets_path":"visits_per_month>total_visits"
}
}
}
}'

In the example above, we referenced the sibling aggregation named total_visits through its parent date histogram named visits_per_month. The full path to the target aggregation will thus be visits_per_month>total_visits.

Also, it’s important to remember that pipeline aggregations cannot have sub-aggregations. Some of the pipeline aggregations such as derivative pipeline aggregation, however, can reference other pipeline aggregations in their buckets_path. This allows chaining multiple pipeline aggregations. For example, we can chain together two first-order derivatives to calculate the second derivative (a derivative of a derivative).

As you remember, metrics and buckets aggregation deal with the gaps in data using the “missing” parameter. Pipeline aggregations use the gap_policy parameter to deal with cases when documents do not contain the required field or when there are no documents that match a query for one or more buckets, etc. This parameter supports the following gap policies:

  • skip — treats missing data as if the bucket does not exist. If the policy is enabled, the aggregation will skip the empty bucket and continue calculating using the next value available.
  • insert_zeros — replaces all missing values with a zero, and pipeline calculation will proceed as normal.

Tutorial

Examples in this tutorial were tested in the following environment:

  • Elasticsearch 6.4.0
  • Kibana 6.4.0

For this tutorial, we’re going to create an index with the data about blog visits. The index mapping will contain three fields: date, visits, and max_time_spent.

curl -XPUT "http://localhost:9200/traffic_stats/" -H "Content-Type: application/json" -d '{
"mappings":{
"blog":{
"properties":{
"date":{
"type":"date",
"format":"dateOptionalTime"
},
"visits":{
"type":"integer"
},
"max_time_spent":{
"type":"integer"
}
}
}
}
}'

Once the index mapping is created, let’s add some random data to the index:

curl -XPOST "http://localhost:9200/traffic_stats/_bulk" -H "Content-Type: application/json" -d' 
{"index":{"_index":"traffic_stats","_type":"blog"}}
{"visits":"488", "date":"2018-10-1", "max_time_spent":"900"} {"index":{"_index":"traffic_stats","_type":"blog"}}
{"visits":"783", "date":"2018-10-6", "max_time_spent":"928"} {"index":{"_index":"traffic_stats","_type":"blog"}}
{"visits":"789", "date":"2018-10-12", "max_time_spent":"1834"} {"index":{"_index":"traffic_stats","_type":"blog"}} {"visits":"1299", "date":"2018-11-3", "max_time_spent":"592"} {"index":{"_index":"traffic_stats","_type":"blog"}}
{"visits":"394", "date":"2018-11-6", "max_time_spent":"1249"} {"index":{"_index":"traffic_stats","_type":"blog"}}
{"visits":"448", "date":"2018-11-24", "max_time_spent":"874"} {"index":{"_index":"traffic_stats","_type":"blog"}}
{"visits":"768", "date":"2018-12-18", "max_time_spent":"876"} {"index":{"_index":"traffic_stats","_type":"blog"}} {"visits":"1194", "date":"2018-12-24", "max_time_spent":"1249"} {"index":{"_index":"traffic_stats","_type":"blog"}}
{"visits":"987", "date":"2018-12-28", "max_time_spent":"1599"} {"index":{"_index":"traffic_stats","_type":"blog"}}
{"visits":"872", "date":"2019-01-1", "max_time_spent":"828"} {"index":{"_index":"traffic_stats","_type":"blog"}}
{"visits":"972", "date":"2019-01-5", "max_time_spent":"723"} {"index":{"_index":"traffic_stats","_type":"blog"}}
{"visits":"827", "date":"2019-02-5", "max_time_spent":"1300"} {"index":{"_index":"traffic_stats","_type":"blog"}} {"visits":"1584", "date":"2019-02-15", "max_time_spent":"1500"} {"index":{"_index":"traffic_stats","_type":"blog"}} {"visits":"1604", "date":"2019-03-2", "max_time_spent":"1488"} {"index":{"_index":"traffic_stats","_type":"blog"}} {"visits":"1499", "date":"2019-03-27", "max_time_spent":"1399"} {"index":{"_index":"traffic_stats","_type":"blog"}} {"visits":"1392", "date":"2019-04-8", "max_time_spent":"1294"} {"index":{"_index":"traffic_stats","_type":"blog"}} {"visits":"1247", "date":"2019-04-15", "max_time_spent":"1194"} {"index":{"_index":"traffic_stats","_type":"blog"}}
{"visits":"984", "date":"2019-05-15", "max_time_spent":"1184"} {"index":{"_index":"traffic_stats","_type":"blog"}} {"visits":"1228", "date":"2019-05-18", "max_time_spent":"1485"} {"index":{"_index":"traffic_stats","_type":"blog"}} {"visits":"1423", "date":"2019-06-14", "max_time_spent":"1452"} {"index":{"_index":"traffic_stats","_type":"blog"}} {"visits":"1238", "date":"2019-06-24", "max_time_spent":"1329"} {"index":{"_index":"traffic_stats","_type":"blog"}} {"visits":"1388", "date":"2019-07-14", "max_time_spent":"1542"} {"index":{"_index":"traffic_stats","_type":"blog"}} {"visits":"1499", "date":"2019-07-24", "max_time_spent":"1742"} {"index":{"_index":"traffic_stats","_type":"blog"}} {"visits":"1523", "date":"2019-08-13", "max_time_spent":"1552"} {"index":{"_index":"traffic_stats","_type":"blog"}} {"visits":"1443", "date":"2019-08-19", "max_time_spent":"1511"} {"index":{"_index":"traffic_stats","_type":"blog"}} {"visits":"1587", "date":"2019-09-14", "max_time_spent":"1497"} {"index":{"_index":"traffic_stats","_type":"blog"}} {"visits":"1534", "date":"2019-09-27", "max_time_spent":"1434"} '

Great! Now, it’s all set to illustrate examples of pipeline aggregations. Let’s start with the avg bucket aggregation.

Avg Bucket Aggregation

Avg bucket pipeline is a typical example of a sibling pipeline aggregation. It works on the numeric values calculated by another sibling aggregation and computes the average of all buckets. Two requirements for sibling aggregations are that a sibling aggregation must be a multi-bucket aggregation and that the metric specified is numeric.

To understand how pipeline aggregations work, it’s reasonable to divide the process of computation into several stages. Let’s take a look at the query below. It will proceed in three steps. First, Elasticsearch will create a date histogram with the one-month interval and apply it to the “visits” field of the index. Date histogram will produce n-buckets with n-documents in them. Next, the sum sub-aggregation will calculate the sum of all visits for each month bucket. Finally, the avg bucket pipeline will reference the sibling sum aggregation and use the sums of each bucket to calculate the average monthly blog visits across all buckets. Thus, we’ll end up with the average of sums of blog visits per month.

curl -X POST "localhost:9200/traffic_stats/_search?size=0&pretty" -H 'Content-Type: application/json' -d '{
"aggs":{
"visits_per_month":{
"date_histogram":{
"field":"date",
"interval":"month"
},
"aggs":{
"total_visits":{
"sum":{
"field":"visits"
}
}
}
},
"avg_monthly_visits":{
"avg_bucket":{
"buckets_path":"visits_per_month>total_visits"
}
}
}
}'

And we should get the following response:

"aggregations":{
"visits_per_month":{
"buckets":[
{
"key_as_string":"2018-10-01T00:00:00.000Z",
"key":1538352000000,
"doc_count":3,
"total_visits":{
"value":2060.0
}
},
{
"key_as_string":"2018-11-01T00:00:00.000Z",
"key":1541030400000,
"doc_count":3,
"total_visits":{
"value":2141.0
}
},
{
"key_as_string":"2018-12-01T00:00:00.000Z",
"key":1543622400000,
"doc_count":3,
"total_visits":{
"value":2949.0
}
},
{
"key_as_string":"2019-01-01T00:00:00.000Z",
"key":1546300800000,
"doc_count":2,
"total_visits":{
"value":1844.0
}
},
{
"key_as_string":"2019-02-01T00:00:00.000Z",
"key":1548979200000,
"doc_count":2,
"total_visits":{
"value":2411.0
}
},
{
"key_as_string":"2019-03-01T00:00:00.000Z",
"key":1551398400000,
"doc_count":2,
"total_visits":{
"value":3103.0
}
},
{
"key_as_string":"2019-04-01T00:00:00.000Z",
"key":1554076800000,
"doc_count":2,
"total_visits":{
"value":2639.0
}
},
{
"key_as_string":"2019-05-01T00:00:00.000Z",
"key":1556668800000,
"doc_count":2,
"total_visits":{
"value":2212.0
}
},
{
"key_as_string":"2019-06-01T00:00:00.000Z",
"key":1559347200000,
"doc_count":2,
"total_visits":{
"value":2661.0
}
},
{
"key_as_string":"2019-07-01T00:00:00.000Z",
"key":1561939200000,
"doc_count":2,
"total_visits":{
"value":2887.0
}
},
{
"key_as_string":"2019-08-01T00:00:00.000Z",
"key":1564617600000,
"doc_count":2,
"total_visits":{
"value":2966.0
}
},
{
"key_as_string":"2019-09-01T00:00:00.000Z",
"key":1567296000000,
"doc_count":2,
"total_visits":{
"value":3121.0
}
}
]
},
"avg_monthly_visits":{
"value":2582.8333333333335
}
}

So, the average of monthly blog visits is 2582.83. Looking closely at the steps we described above, you can get an idea of how pipeline aggregations work. They take the intermediary results of metrics and/or bucket aggregations and make additional computations on them. This approach is very useful when your data does not contain intermediary results, and the latter should be derived implicitly during the aggregation process.

Derivative Aggregation

This is a parent pipeline aggregation that calculates a derivative of a specified metric in a parent histogram or date histogram aggregation. There are two requirements for this aggregation:

  • The metric must be numeric, otherwise finding a derivative will be impossible.
  • The enclosing histogram must have min_doc_count set to 0 (this is the default for histogram aggregations). If min_doc_count is greater than 0, some buckets will be omitted, which may lead to confusing or erroneous derivative values.

In mathematics, the derivative of a function measures the sensitivity to change of the function value (output value) with respect to a change in its argument (input value). In other words, a derivative evaluates the speed of change in some function depending on its variables. Applying this concept to our data, we could say that the derivative aggregation calculates the speed of change in our numeric data compared to the previous periods. Let’s look at the real example to get a better understanding of what we are talking about.

First, we will calculate the first order derivative. The first derivative tells us whether a function is increasing or decreasing, and by how much it is increasing or decreasing. Take a look at the example below:

curl -X POST "localhost:9200/traffic_stats/_search?size=0&pretty" -H 'Content-Type: application/json' -d '{
"aggs":{
"visits_per_month":{
"date_histogram":{
"field":"date",
"interval":"month"
},
"aggs":{
"total_visits":{
"sum":{
"field":"visits"
}
},
"visits_deriv":{
"derivative":{
"buckets_path":"total_visits"
}
}
}
}
}
}'

The buckets_path instructs the derivative aggregation to use the output of the total_visits parent aggregation for the derivative (we should use the parent aggregation because derivatives are parent pipeline aggregations).

The response to the above query should look something like this:

"aggregations":{
"visits_per_month":{
"buckets":[
{
"key_as_string":"2018-10-01T00:00:00.000Z",
"key":1538352000000,
"doc_count":3,
"total_visits":{
"value":2060.0
}
},
{
"key_as_string":"2018-11-01T00:00:00.000Z",
"key":1541030400000,
"doc_count":3,
"total_visits":{
"value":2141.0
},
"visits_deriv":{
"value":81.0
}
},
{
"key_as_string":"2018-12-01T00:00:00.000Z",
"key":1543622400000,
"doc_count":3,
"total_visits":{
"value":2949.0
},
"visits_deriv":{
"value":808.0
}
},
{
"key_as_string":"2019-01-01T00:00:00.000Z",
"key":1546300800000,
"doc_count":2,
"total_visits":{
"value":1844.0
},
"visits_deriv":{
"value":-1105.0
}
},
{
"key_as_string":"2019-02-01T00:00:00.000Z",
"key":1548979200000,
"doc_count":2,
"total_visits":{
"value":2411.0
},
"visits_deriv":{
"value":567.0
}
},
{
"key_as_string":"2019-03-01T00:00:00.000Z",
"key":1551398400000,
"doc_count":2,
"total_visits":{
"value":3103.0
},
"visits_deriv":{
"value":692.0
}
},
{
"key_as_string":"2019-04-01T00:00:00.000Z",
"key":1554076800000,
"doc_count":2,
"total_visits":{
"value":2639.0
},
"visits_deriv":{
"value":-464.0
}
},
{
"key_as_string":"2019-05-01T00:00:00.000Z",
"key":1556668800000,
"doc_count":2,
"total_visits":{
"value":2212.0
},
"visits_deriv":{
"value":-427.0
}
},
{
"key_as_string":"2019-06-01T00:00:00.000Z",
"key":1559347200000,
"doc_count":2,
"total_visits":{
"value":2661.0
},
"visits_deriv":{
"value":449.0
}
},
{
"key_as_string":"2019-07-01T00:00:00.000Z",
"key":1561939200000,
"doc_count":2,
"total_visits":{
"value":2887.0
},
"visits_deriv":{
"value":226.0
}
},
{
"key_as_string":"2019-08-01T00:00:00.000Z",
"key":1564617600000,
"doc_count":2,
"total_visits":{
"value":2966.0
},
"visits_deriv":{
"value":79.0
}
},
{
"key_as_string":"2019-09-01T00:00:00.000Z",
"key":1567296000000,
"doc_count":2,
"total_visits":{
"value":3121.0
},
"visits_deriv":{
"value":155.0
}
}
]
}
}

If you compare two adjacent buckets, you’ll see that the first derivative is simply the difference between the total visits in the current and the previous bucket. For example:

{
"key_as_string":"2019-08-01T00:00:00.000Z",
"key":1564617600000,
"doc_count":2,
"total_visits":{
"value":2966.0
},
"visits_deriv":{
"value":79.0
}
},
{
"key_as_string":"2019-09-01T00:00:00.000Z",
"key":1567296000000,
"doc_count":2,
"total_visits":{
"value":3121.0
},
"visits_deriv":{
"value":155.0
}
}

As you see, the total number of visits in August 2018 was 2966 compared to 3121 in September 2019. If we subtract 2966 from 3121 we’ll get the first derivative value which is 155.0. It’s as simple as that!

Let’s visualize the first derivative in Kibana:

To visualize the derivative, we need to select the derivative pipeline aggregation and the custom metric used by the derivative, which is the sum aggregation on the “visits” field. In the X-Axis, we should define the Date Histogram aggregation on the “date” field using the “month” interval. After you run the visualization, Kibana will create vertical bars for each derivative. Positive derivatives will be placed closer to the top of the graph and negative derivatives are placed closer to the bottom.

Second-Order Derivative

The second derivative is the double derivative or the derivative of the derivative. It measures how the rate of change of a quantity is itself changing.

In Elasticsearch, we can calculate the second derivative by chaining the derivative pipeline aggregation onto the output of another derivative pipeline aggregation. In this way, we first calculate the first derivative and then the second based on the first. Let’s see an example below:

curl -X POST "localhost:9200/traffic_stats/_search?size=0&pretty" -H 'Content-Type: application/json' -d '{
"aggs":{
"visits_per_month":{
"date_histogram":{
"field":"date",
"interval":"month"
},
"aggs":{
"total_visits":{
"sum":{
"field":"visits"
}
},
"visits_deriv":{
"derivative":{
"buckets_path":"total_visits"
}
},
"visits_2nd_deriv":{
"derivative":{
"buckets_path":"visits_deriv"
}
}
}
}
}
}'

As you see, the first derivative uses the path to total_visits calculated by the sum aggregation, and the second derivative uses the path to the visits_deriv , which is the first derivative pipeline. In this way, we can think of the second derivative calculation as the double pipeline aggregation. The above query should return the following response:

"aggregations":{
"visits_per_month":{
"buckets":[
{
"key_as_string":"2018-10-01T00:00:00.000Z",
"key":1538352000000,
"doc_count":3,
"total_visits":{
"value":2060.0
}
},
{
"key_as_string":"2018-11-01T00:00:00.000Z",
"key":1541030400000,
"doc_count":3,
"total_visits":{
"value":2141.0
},
"visits_deriv":{
"value":81.0
}
},
{
"key_as_string":"2018-12-01T00:00:00.000Z",
"key":1543622400000,
"doc_count":3,
"total_visits":{
"value":2949.0
},
"visits_deriv":{
"value":808.0
},
"visits_2nd_deriv":{
"value":727.0
}
},
{
"key_as_string":"2019-01-01T00:00:00.000Z",
"key":1546300800000,
"doc_count":2,
"total_visits":{
"value":1844.0
},
"visits_deriv":{
"value":-1105.0
},
"visits_2nd_deriv":{
"value":-1913.0
}
},
{
"key_as_string":"2019-02-01T00:00:00.000Z",
"key":1548979200000,
"doc_count":2,
"total_visits":{
"value":2411.0
},
"visits_deriv":{
"value":567.0
},
"visits_2nd_deriv":{
"value":1672.0
}
}
]
}
}

Let’s look closely at two adjacent buckets to see what the second derivative really indicates:

{
"key_as_string":"2018-11-01T00:00:00.000Z",
"key":1541030400000,
"doc_count":3,
"total_visits":{
"value":2141.0
},
"visits_deriv":{
"value":81.0
}
},
{
"key_as_string":"2018-12-01T00:00:00.000Z",
"key":1543622400000,
"doc_count":3,
"total_visits":{
"value":2949.0
},
"visits_deriv":{
"value":808.0
},
"visits_2nd_deriv":{
"value":727.0
}
}

So, as you see, the first derivative is just the difference between the total visits in the current bucket (e.g., 2018-12-01 bucket) and the previous bucket (2019-11-01). That's what we know from the previous example. In our case, this difference is 808 (2949 - 2141).

What is the second derivative? It’s just the difference between the first derivatives of two adjacent buckets. For example, the first derivative of the “2018–11–01” bucket is 81, and the first derivative of the “2018–12–01” bucket is 808.0. Thus the second derivative of the “2018–12–01” bucket is 727.0 (808–81). As simple as that!

Hypothetically, we could design three chained pipeline aggregation to calculate the third, the fourth, and even higher order derivatives. That would, however, provide little to no value for most data.

Note: there are no second derivatives for the first two buckets because we need at least 2 data points from the first derivative to calculate the second derivative.

Min and Max Bucket Aggregation

Max bucket aggregation is a sibling pipeline aggregation that searches for the bucket(s) with the maximum value of some metric in a sibling aggregation and outputs both the value and the key(s) of the bucket(s). The metric must be numeric, and the sibling aggregation must be a multi-bucket aggregation.

In the example below, max bucket aggregation calculates the maximum number of total monthly visits across all buckets generated by the date histogram aggregation. In this case, the max bucket aggregation targets the result of the total_visits sum aggregation, which is its sibling aggregation.

curl -X POST "localhost:9200/traffic_stats/_search?size=0&pretty" -H 'Content-Type: application/json' -d '{
"aggs":{
"visits_per_month":{
"date_histogram":{
"field":"date",
"interval":"month"
},
"aggs":{
"total_visits":{
"sum":{
"field":"visits"
}
}
}
},
"max_monthly_visits":{
"max_bucket":{
"buckets_path":"visits_per_month>total_visits"
}
}
}
}'

The query above should return the following result.

"aggregations":{
"visits_per_month":{
"buckets":[
{
"key_as_string":"2018-10-01T00:00:00.000Z",
"key":1538352000000,
"doc_count":3,
"total_visits":{
"value":2060.0
}
},
{
"key_as_string":"2018-11-01T00:00:00.000Z",
"key":1541030400000,
"doc_count":3,
"total_visits":{
"value":2141.0
}
},
{
"key_as_string":"2018-12-01T00:00:00.000Z",
"key":1543622400000,
"doc_count":3,
"total_visits":{
"value":2949.0
}
},
{
"key_as_string":"2019-01-01T00:00:00.000Z",
"key":1546300800000,
"doc_count":2,
"total_visits":{
"value":1844.0
}
},
{
"key_as_string":"2019-02-01T00:00:00.000Z",
"key":1548979200000,
"doc_count":2,
"total_visits":{
"value":2411.0
}
},
{
"key_as_string":"2019-03-01T00:00:00.000Z",
"key":1551398400000,
"doc_count":2,
"total_visits":{
"value":3103.0
}
},
{
"key_as_string":"2019-04-01T00:00:00.000Z",
"key":1554076800000,
"doc_count":2,
"total_visits":{
"value":2639.0
}
},
{
"key_as_string":"2019-05-01T00:00:00.000Z",
"key":1556668800000,
"doc_count":2,
"total_visits":{
"value":2212.0
}
},
{
"key_as_string":"2019-06-01T00:00:00.000Z",
"key":1559347200000,
"doc_count":2,
"total_visits":{
"value":2661.0
}
},
{
"key_as_string":"2019-07-01T00:00:00.000Z",
"key":1561939200000,
"doc_count":2,
"total_visits":{
"value":2887.0
}
},
{
"key_as_string":"2019-08-01T00:00:00.000Z",
"key":1564617600000,
"doc_count":2,
"total_visits":{
"value":2966.0
}
},
{
"key_as_string":"2019-09-01T00:00:00.000Z",
"key":1567296000000,
"doc_count":2,
"total_visits":{
"value":3121.0
}
}
]
},
"max_monthly_visits":{
"value":3121.0,
"keys":[
"2019-09-01T00:00:00.000Z"
]
}
}

As you see, the sum aggregation calculated the sum of all visits for each month bucket. Then, our max bucket pipeline aggregation evaluated the results and identified the bucket with the maximum visits value, which is 3121 — the value of the “2019–09–01” bucket.

Min bucket aggregation has the same logic. To make it work, we only need to replace the max_bucket with the min_bucket in the query.

"min_monthly_visits":{
"min_bucket":{
"buckets_path":"visits_per_month>total_visits"
}
}

It will return the minimum number of total monthly visits:

"avg_monthly_visits":{
"value":1844.0,
"keys":[
"2019-01-01T00:00:00.000Z"
]
}

This is the value of the “2019–01–01” bucket.

Sum and Cumulative Sum Buckets Aggregations

There are situations when you need to calculate the sum of all bucket values calculated by some other aggregation. In this case, you can use a sum bucket aggregation, which is a sibling pipeline aggregation.

Let’s calculate the sum of monthly visits across all buckets:

curl -X POST "localhost:9200/traffic_stats/_search?size=0&pretty" -H 'Content-Type: application/json' -d '{
"aggs":{
"visits_per_month":{
"date_histogram":{
"field":"date",
"interval":"month"
},
"aggs":{
"total_visits":{
"sum":{
"field":"visits"
}
}
}
},
"sum_monthly_visits":{
"sum_bucket":{
"buckets_path":"visits_per_month>total_visits"
}
}
}
}'

As you see, this pipeline aggregation targets the sibling total_visits aggregation that represents total monthly visits. The response should look something like this:

"aggregations":{
"visits_per_month":{
"buckets":[
... {
"key_as_string":"2019-06-01T00:00:00.000Z",
"key":1559347200000,
"doc_count":2,
"total_visits":{
"value":2661.0
}
},
{
"key_as_string":"2019-07-01T00:00:00.000Z",
"key":1561939200000,
"doc_count":2,
"total_visits":{
"value":2887.0
}
},
{
"key_as_string":"2019-08-01T00:00:00.000Z",
"key":1564617600000,
"doc_count":2,
"total_visits":{
"value":2966.0
}
},
{
"key_as_string":"2019-09-01T00:00:00.000Z",
"key":1567296000000,
"doc_count":2,
"total_visits":{
"value":3121.0
}
}
]
},
"sum_monthly_visits":{
"value":30994.0
}
}

So, our sum pipeline aggregation simply calculated the sum of all monthly visits per bucket, which in itself is the sum of all visits per month calculated by the sibling sum aggregation.

Cumulative sum aggregation takes a different approach. In general, a cumulative sum is a sequence of partial sums of a given sequence. For example, the cumulative sums of the sequence {a,b,c,…} are a, a+b, a+b+c, …

Cumulative sum aggregation is a parent pipeline aggregation that calculates the cumulative sum of a specified metric in a parent histogram (or date_histogram) aggregation. As with other parent pipeline aggregation, the specified metric must be numeric and the enclosing histogram must have min_doc_count set to 0 (default for histogram aggregations).

curl -X POST "localhost:9200/traffic_stats/_search?size=0&pretty" -H 'Content-Type: application/json' -d '{
"aggs":{
"visits_per_month":{
"date_histogram":{
"field":"date",
"interval":"month"
},
"aggs":{
"total_visits":{
"sum":{
"field":"visits"
}
},
"cumulative_visits":{
"cumulative_sum":{
"buckets_path":"total_visits"
}
}
}
}
}
}'

The response will look something like this:

"aggregations":{
"visits_per_month":{
"buckets":[
{
"key_as_string":"2018-10-01T00:00:00.000Z",
"key":1538352000000,
"doc_count":3,
"total_visits":{
"value":2060.0
},
"cumulative_visits":{
"value":2060.0
}
},
{
"key_as_string":"2018-11-01T00:00:00.000Z",
"key":1541030400000,
"doc_count":3,
"total_visits":{
"value":2141.0
},
"cumulative_visits":{
"value":4201.0
}
},
{
"key_as_string":"2018-12-01T00:00:00.000Z",
"key":1543622400000,
"doc_count":3,
"total_visits":{
"value":2949.0
},
"cumulative_visits":{
"value":7150.0
}
},
{
"key_as_string":"2019-01-01T00:00:00.000Z",
"key":1546300800000,
"doc_count":2,
"total_visits":{
"value":1844.0
},
"cumulative_visits":{
"value":8994.0
}
},
{
"key_as_string":"2019-02-01T00:00:00.000Z",
"key":1548979200000,
"doc_count":2,
"total_visits":{
"value":2411.0
},
"cumulative_visits":{
"value":11405.0
}
},
{
"key_as_string":"2019-03-01T00:00:00.000Z",
"key":1551398400000,
"doc_count":2,
"total_visits":{
"value":3103.0
},
"cumulative_visits":{
"value":14508.0
}
},
{
"key_as_string":"2019-04-01T00:00:00.000Z",
"key":1554076800000,
"doc_count":2,
"total_visits":{
"value":2639.0
},
"cumulative_visits":{
"value":17147.0
}
},
{
"key_as_string":"2019-05-01T00:00:00.000Z",
"key":1556668800000,
"doc_count":2,
"total_visits":{
"value":2212.0
},
"cumulative_visits":{
"value":19359.0
}
},
{
"key_as_string":"2019-06-01T00:00:00.000Z",
"key":1559347200000,
"doc_count":2,
"total_visits":{
"value":2661.0
},
"cumulative_visits":{
"value":22020.0
}
},
{
"key_as_string":"2019-07-01T00:00:00.000Z",
"key":1561939200000,
"doc_count":2,
"total_visits":{
"value":2887.0
},
"cumulative_visits":{
"value":24907.0
}
},
{
"key_as_string":"2019-08-01T00:00:00.000Z",
"key":1564617600000,
"doc_count":2,
"total_visits":{
"value":2966.0
},
"cumulative_visits":{
"value":27873.0
}
},
{
"key_as_string":"2019-09-01T00:00:00.000Z",
"key":1567296000000,
"doc_count":2,
"total_visits":{
"value":3121.0
},
"cumulative_visits":{
"value":30994.0
}
}
]
}
}

As you see, the aggregation first calculates the sum of two buckets and then adds up the result to the value of the next bucket and so on. In this way, it accumulates the sums of all buckets in the sequence.

Conclusion

That’s it! As we saw, pipeline aggregations help implement complex computations involving intermediary values and buckets produced by other aggregations. This allows extracting complex measures such as derivative, moving average, second-order derivative, and other measures that are not directly available in the data and involve several intermediary steps to be calculated.

In the next part of the pipeline aggregation series, we’ll continue with the analysis of pipeline aggregations focusing on such aggregations as the moving average, percentiles, moving function, serial differences, bucket sort, and other common pipeline aggregations.


Originally published at qbox.io.

Qbox Search as a Service

This is the home blog of Qbox, the providers of Hosted Elasticsearch

Kirill Goltsman

Written by

I am a blog writer with the interest in cloud-native technologies and container orchestration.

Qbox Search as a Service

This is the home blog of Qbox, the providers of Hosted Elasticsearch

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade