# Comprehensive Guide to Elasticsearch Metrics Aggregations: Part I

With this blog post we begin a comprehensive overview of Elasticsearch metrics aggregations that focuses on Elasticsearch numeric metrics aggregations — a subset of metrics aggregations that produces numeric values. There are two types of these aggregations in Elasticsearch: single-value aggregations, which output a single value, and multi-value aggregations, which generate multiple metrics.

In the first part of our metrics aggregations series, we’ll discuss such single-value metrics aggregations as average and weighted average, min, max, and cardinality. The only multi-value aggregation type discussed in this article is extended stats aggregation. To help you understand how these aggregations work, we’ll accompany each description with the corresponding visualization in the Kibana dashboard. Let’s get started!

# Tutorial

Examples in this tutorial were tested in the following environment:

- Elasticsearch 6.4.0
- Kibana 6.4.0

## Creating a New Index

To illustrate various metrics aggregations mentioned in the intro, we first need to create a new “sports” index that stores a collection of “athlete” documents. The index mapping will contain such fields as athlete’s location, name, rating, sport, age, number of scored goals, and field position (e.g., midfielder). Let’s create the index mapping:

`curl -XPUT "http://localhost:9200/sports/" -H "Content-Type: application/json" -d' `

{

"mappings":{

"athlete":{

"properties":{

"birthdate":{

"type":"date",

"format":"dateOptionalTime"

},

"location":{

"type":"geo_point"

},

"name":{

"type":"keyword"

},

"rating":{

"type":"integer"

},

"sport":{

"type":"keyword"

},

"age":{

"type":"integer"

},

"goals":{

"type":"integer"

},

"role":{

"type":"keyword"

},

"score_weight":{

"type":"float"

}

}

}

}

}'

Once the index mapping is created, let’s use Elasticsearch Bulk API to save some data to our sports index. Bulk indexing allows us to send multiple documents to the index in a single call:

Awesome! We are now prepared to demonstrate some of the most common single-value aggregations in Elasticsearch. Let’s start with the avg aggregation.

# Avg Aggregation

The `avg`

metrics aggregation computes the arithmetic mean of numeric values extracted from the numeric field in your documents. As with any other metrics aggregation, `avg`

aggregation requires numeric values retrieved from the specific field or generated by a provided script. In this tutorial, we’ll focus only on the first scenario, but if you want to learn more about using Elasticsearch aggregations with scripting, you can read this excellent article.

Back to our avg aggregation….. The simplest thing we can do with it is to compute the average age of all athletes in our sports index. Take a look at this request:

`curl -X POST "localhost:9200/sports/athlete/_search?size=0&pretty" -H 'Content-Type: application/json' -d '{`

"aggs":{

"avg_age":{

"avg":{

"field":"age"

}

}

}

}'

As you see, we specified the field from which the `avg`

aggregation will extract numeric values. Elasticsearch will run through our index and look at the `age`

field in each document. As a result, the above request will produce the following output:

{

"took":1,

"timed_out":false,

"_shards":{

"total":5,

"successful":5,

"skipped":0,

"failed":0

},

"hits":{

"total":44,

"max_score":0.0,

"hits":[]

},

"aggregations":{

"avg_age":{

"value":27.318181818181817

}

}

}

The aggregation results are returned in the `aggregations`

object at the end of the response. This object includes our `avg`

aggregation with the value of 27.318, which is the average age of athletes in our index.

Let’s complicate things a little bit by calculating the average age of athletes playing a specific game: football, basketball, hockey, or handball. To do this, we can combine `avg`

aggregation with the buckets aggregation. Buckets aggregations create buckets/sets of documents based on certain criteria. They determine whether or not a given document falls into a particular bucket. In addition, bucket aggregations return the number of documents that fell into each bucket.

To calculate the average age of athletes in each sport, we need to generate buckets using the terms aggregation, which is a multi-bucket aggregation that builds one bucket for each of the unique values found in the field. Because we have four unique sports in our collection, we can expect to have four buckets for our data set. After calculating the terms bucket aggregation, we need to apply the `avg`

aggregation to each generated bucket. Let’s take a look at the code here:

`curl -X POST "localhost:9200/sports/athlete/_search?size=0&pretty" -H 'Content-Type: application/json' -d' {`

"aggs":{

"sports":{

"terms":{

"field":"sport"

},

"aggs":{

"avg_age":{

"avg":{

"field":"age"

}

}

}

}

}

}'

As you see, we specified the `sport`

field as a source for the terms bucket aggregation and the age field as a source field for the average aggregation of each bucket. The above request should produce the following output:

............"aggregations":{

"sports":{

"doc_count_error_upper_bound":0,

"sum_other_doc_count":0,

"buckets":[

{

"key":"Football",

"doc_count":18,

"avg_age":{

"value":26.444444444444443

}

},

{

"key":"Basketball",

"doc_count":10,

"avg_age":{

"value":28.6

}

},

{

"key":"Hockey",

"doc_count":10,

"avg_age":{

"value":27.4

}

},

{

"key":"Handball",

"doc_count":6,

"avg_age":{

"value":27.666666666666668

}

}

]

}

}

}

As we expected, Elasticsearch generated four buckets that represent all sports types in our index. Each bucket is an object that contains the name of the bucket (`key`

), the number of documents that fell into the bucket (`doc_count`

), and the average age of athletes for each bucket. As the results suggest, the highest average age of athletes is in the basketball bucket (28.6).

Let’s now learn how to visualize this aggregation in Kibana. As you see in the image below, in the y-axis we use the `avg`

aggregation on the age field. Correspondingly, in the x-axis, we create a buckets terms aggregation on a sport field. The total size of buckets is five, and they are ordered by the `Avg Age`

metrics used in the y-axis. If you have more than five categories in your data, you should consider setting a greater buckets size.

**Note**: Elasticsearch has a default Elasticsearch bucket limit. The maximum number of buckets allowed in a single response is limited by the dynamic cluster setting named `search.max_buckets`

. This setting is disabled by default (-1). However, requests trying to return more than 10,000 buckets will cause a deprecation warning.

# Missing Values

Sometimes a target field in the document might be empty. The default behavior of the majority of metrics aggregations is to simply ignore such documents. However, optionally, you can treat documents with a missing value as if they had a value. For example:

`curl -X POST "localhost:9200/grades/student/_search?size=0?pretty" -H 'Content-Type: application/json' -d '{`

"aggs":{

"avg_grade":{

"avg":{

"field":"grade",

"missing":20

}

}

}

}'

In this example, documents with no value in the grade field will fall into the same bucket as documents that have the value of 20 in that field.

# Weighted Avg Aggregation

Weighted avg aggregation was added in Elasticsearch 6.4.0. In other Elasticsearch versions, using this aggregation will throw `"Unknown BaseAggregationBuilder [weighted_avg]`

exception.

To use this aggregation type, you must first understand the difference between a regular average and weighted average. When computing a regular average, each data point has an equal weight (1), which results in the data point contributing equally to the final value. On the other hand, each data point in the weighted average might have different weight depending on how much it should contribute to the final value. Thus, the formula for the weighted average is the `∑(value * weight) / ∑(weight)`

.

Let’s look at our sports index to illustrate why we may need a weighted average instead of the regular average. The total number of goals of best scorers in different sports sometimes significantly differs. For example, on average, hockey players score more goals than footballers, and basketball players score more than hockey players.

Second, scoring frequency usually depends on the field position of a player. Forwards tend to score more than midfielders, and midfielders score more than defenders. If we calculate a score average that does not take these differences into account, the results might be skewed in favor of the high-frequency scoring sports and higher scoring positions like forwards.

We can solve the first problem by calculating the average goals for each sport. The second problem can be solved by assigning different weights to different field positions. The highest weight could be assigned to defenders because they are supposed to score less (hence, more weight if they score) and the lowest weight to forwards because scoring goals is what they are expected to do. We implement this idea in a `score_weight`

field that has a weight of 2 for forwards, 3 for midfielders, and 4 for defenders. These weights will guarantee that the final result properly reflects the average scoring pattern in each sport, accounting for different roles in the game. As opposed to these weights, a regular average can be thought of as a weighted average where every value has an implicit weight of `1`

.

**Note**: these values are arbitrary and do not represent the scoring frequency at each field position. However, assigning those weights helps illustrate how the weighted average aggregation works.

`curl -X POST "localhost:9200/sports/athlete/_search?size=0&pretty" -H 'Content-Type: application/json' -d '{`

"aggs":{

"scoring_weighted_average":{

"terms":{

"field":"sport"

},

"aggs":{

"weighted_goals_in_sport":{

"weighted_avg":{

"value":{

"field":"goals"

},

"weight":{

"field":"score_weight"

}

}

}

}

}

}

}'

The above request should produce the following output:

`{`

"aggregations":{

"scoring_weighted_average":{

"doc_count_error_upper_bound":0,

"sum_other_doc_count":0,

"buckets":[

{

"key":"Football",

"doc_count":9,

"weighted_goals_in_sport":{

"value":53.214285714285715

}

},

{

"key":"Basketball",

"doc_count":5,

"weighted_goals_in_sport":{

"value":1147.090909090909

}

},

{

"key":"Hockey",

"doc_count":5,

"weighted_goals_in_sport":{

"value":134.30769230769232

}

},

{

"key":"Handball",

"doc_count":3,

"weighted_goals_in_sport":{

"value":212.77777777777777

}

}

]

}

}

}

Let’s compare these results with the regular average goals in each sport to see the difference:

`curl -X POST "localhost:9200/sports/athlete/_search?size=0&pretty" -H 'Content-Type: application/json' -d '{`

"aggs":{

"sports":{

"terms":{

"field":"sport"

},

"aggs":{

"avg_goals":{

"avg":{

"field":"goals"

}

}

}

}

}

}'

Response:

`{`

"aggregations":{

"sports":{

"doc_count_error_upper_bound":0,

"sum_other_doc_count":0,

"buckets":[

{

"key":"Football",

"doc_count":9,

"avg_goals":{

"value":54.888888888888886

}

},

{

"key":"Basketball",

"doc_count":5,

"avg_goals":{

"value":1177.0

}

},

{

"key":"Hockey",

"doc_count":5,

"avg_goals":{

"value":139.2

}

},

{

"key":"Handball",

"doc_count":3,

"avg_goals":{

"value":245.33333333333334

}

}

]

}

}

}

You can see that without weights, the regular average is significantly higher for handball (245.3 vs. 212.7), basketball (1177 vs. 1147), hockey (139.2 vs. 134.3), and almost equal for football (54.8 vs. 53.2). If weights represent real patterns in the computed values, then the weighted average tends to be a more accurate value than the regular average.

# Cardinality Aggregation

Cardinality aggregation allows searching for unique documents in your index by counting unique values for a specific field. For example, by using the cardinality aggregation we can calculate the total number of unique sport categories in our sports index:

`curl -X POST "localhost:9200/sports/athlete/_search?size=0&pretty" -H 'Content-Type: application/json' -d '{`

"aggs":{

"sports_count":{

"cardinality":{

"field":"sport"

}

}

}

}'

The above query should return the following response:

{

"took":1,

"timed_out":false,

"_shards":{

"total":5,

"successful":5,

"skipped":0,

"failed":0

},

"hits":{

"total":22,

"max_score":0.0,

"hits":[]

},

"aggregations":{

"sports_count":{

"value":4

}

}

}

The calculation of sports cardinality did not take much time and was not very memory-intensive because we have only four sports in our index. However, computing cardinality aggregation can be more resource-intensive if our index has multiple unique values. For example, calculating age cardinality for our 22-entry index would obviously use more compute resources:

`curl -X POST "localhost:9200/sports/athlete/_search?size=0&pretty" -H 'Content-Type: application/json' -d '{`

"aggs":{

"sports_count":{

"cardinality":{

"field":"age"

}

}

}

}'

Take a look at the response:

{

"took":4,

"timed_out":false,

"_shards":{

"total":5,

"successful":5,

"skipped":0,

"failed":0

},

"hits":{

"total":22,

"max_score":0.0,

"hits":[]

},

"aggregations":{

"sports_count":{

"value":16

}

}

}

Cardinality calculation would be even more resource-intensive if we had thousands of athletes in our collection. The accurate cardinality calculation would require loading all values into a hash set and returning its size. This approach does not scale well on high-cardinality sets and/or large values because it requires more memory and causes high latency in the distributed cluster environments.

How does Elasticsearch solve this problem? Under the hood, Elasticsearch calculates cardinality aggregation based on the HyperLogLog++ algorithm, which has the following features:

- configurable precision, which determines how to trade memory for accuracy,
- excellent accuracy on low-cardinality sets,
- fixed memory usage: memory usage will depend on the configured precision no matter how many documents are there in your index.

In other words, if you have a low-cardinality set as in our example above, the algorithm’s calculation will be fully accurate. If you have a high-cardinality dataset, you can set the `precision_threshold`

that trades memory for accuracy. This setting defines a unique count of documents below in which counts are expected to be close to accurate. Above this value, counts might become a bit less accurate.

# Min and Max Aggregations

Min and max aggregations are simple single-value metrics aggregation for calculating min/max values among the numeric values from the aggregated documents.

For example, let’s find the maximum age among all athletes in our index:

`curl -X POST "localhost:9200/sports/athlete/_search?size=0&pretty" -H 'Content-Type: application/json' -d '{`

"aggs":{

"max_age":{

"max":{

"field":"age"

}

}

}

}'

The response indicates that the maximum age among all athletes is `41.0`

years old.

{

"took":4,

"timed_out":false,

"_shards":{

"total":5,

"successful":5,

"skipped":0,

"failed":0

},

"hits":{

"total":22,

"max_score":0.0,

"hits":[]

},

"aggregations":{

"max_age":{

"value":41.0

}

}

}

Min aggregation looks very similar:

`curl -X POST "localhost:9200/sports/athlete/_search?size=0&pretty" -H 'Content-Type: application/json' -d '{`

"aggs":{

"min":{

"min":{

"field":"age"

}

}

}

}'

Response:

"hits":{

"total":22,

"max_score":0.0,

"hits":[]

},

"aggregations":{

"min":{

"value":18.0

}

}

As in the previous example, we can obtain more fine-grained min/max aggregation of age for each sports category.

`curl -X POST "localhost:9200/sports/athlete/_search?size=0&pretty" -H 'Content-Type: application/json' -d '{`

"aggs":{

"sports":{

"terms":{

"field":"sport"

},

"aggs":{

"max_age":{

"max":{

"field":"age"

}

}

}

}

}

}'

Now, we can see the max ages of players in each sport featured in our index:

`"aggregations":{`

"sports":{

"doc_count_error_upper_bound":0,

"sum_other_doc_count":0,

"buckets":[

{

"key":"Football",

"doc_count":9,

"max_age":{

"value":35.0

}

},

{

"key":"Basketball",

"doc_count":5,

"max_age":{

"value":36.0

}

},

{

"key":"Hockey",

"doc_count":5,

"max_age":{

"value":41.0

}

},

{

"key":"Handball",

"doc_count":3,

"max_age":{

"value":29.0

}

}

]

}

}

Min/Max aggregations like this ca be easily visualized in Kibana by setting max aggregation for the x-axis and creating buckets for each sport in the y-axis.

# Extended Stats Aggregation

So far, we have discussed only examples of single-value aggregations. However, Elasticsearch also offers a multi-value metrics aggregation called extended stats, which computes stats over numeric values extracted from the aggregated documents. These stats include such metrics as min, max, avg, sum, sum of squares, standard deviation, etc. that are returned in a single object. The extended stats aggregation is a very convenient way to derive all metric aggregations at once.

We cannot, however, use extended stats for visualization in Kibana, although we can use its individual parts for visualization.

In the example below, we use the extended stats aggregation to extract age stats:

`curl -X GET "localhost:9200/sports/athlete/_search?size=0&pretty" -H 'Content-Type: application/json' -d '{`

"aggs":{

"age_stats":{

"extended_stats":{

"field":"age"

}

}

}

}'

The above aggregation computes the age statistics over all documents. The aggregation type is `extended_stats`

and the `field`

setting defines the numeric field of the documents on which the stats will be computed. The above request will return the following output:

`"aggregations":{`

"age_stats":{

"count":22,

"min":18.0,

"max":41.0,

"avg":27.318181818181817,

"sum":601.0,

"sum_of_squares":17181.0,

"variance":34.67148760330581,

"std_deviation":5.888249961007584,

"std_deviation_bounds":{

"upper":39.09468174019698,

"lower":15.541681896166649

}

}

}

One of the most important statistical properties computed by the extended stats aggregation is standard deviation. This is one of the main statistical measures that quantifies the amount of variation of a set of data values. A low standard deviation shows that data points are close to the mean, while a high deviation indicates that the data is spread out over a wider range of values.

Along with the regular standard deviation, the extended stats aggregation returns an object called `std_deviation_bounds`

, which provides an interval of plus/minus two standard deviations from the mean. This metric is very useful for visualizing the variance in your data. If you want a different boundary, for example, three standard deviations, you can set `sigma`

in the request:

`curl -X GET "localhost:9200/sports/athlete/_search?pretty" -H 'Content-Type: application/json' -d '{`

"size":0,

"aggs":{

"age_stats":{

"extended_stats":{

"field":"age",

"sigma":3

}

}

}

}'

Sigma parameter controls how many standard deviations +/- from the mean should be displayed. Please note that for the standard deviation to display accurate values, your data should be normally distributed.

Let’s display the lower and upper bounds of standard deviation of age for each sport category in Kibana. We need to select standard deviation aggregation in the y-axis and generate terms buckets in the x-axis.

As you see, the standard deviation upper bound is the highest for hockey, and the lower bound is higher for Handball.

# Conclusion

That’s it! In this article, we discussed several Elasticsearch metrics aggregations and demonstrated how to visualize them in Kibana.

When considered only for themselves, metrics aggregations offer a number of useful insights about your data. However, if you combine metrics aggregations with buckets aggregations, you can get even more insights into various categories and types of data found in your indices.

In the next part of the metrics aggregation series, we’ll discuss other metrics aggregation types such as geo bounds, geo centroid, percentiles, percentiles ranks, sum, top hits, and value count aggregation. Stay tuned to our blog to find out more!

*Originally published at **qbox.io**.*