How to start using metrics with Grafana and not fail. Part 2/3
This is second part of the article. First part is here.
Aggregating
Before we start a discussion about what we should measure, it is important to understand how these data will be stored. Let’s imagine that we created a new messenger — WhatsUp killer — and we want to count how many users log-in per second. Let’s assume that we are monitoring logins and got the following data:
+--------+-----------------------+
| second | Logins count a second |
+--------+-----------------------+
| 1 | 5 |
| 2 | 8 |
| 3 | 10 |
| 4 | 15 |
| 5 | 3 |
| 6 | 18 |
| 7 | 12 |
| 8 | 11 |
| 9 | 8 |
| 10 | 7 |
+--------+-----------------------+
We also want to store these numbers to the database as is. But it is common that an application produces hundreds of metrics and to store these data accurately, as well as send all the metrics every second, it can be really costly. The solution is to aggregate metrics for some amount of seconds and sends it one at a time. It is typical to send metrics once per 10 seconds or once per 30 seconds. For our messenger, instead of sending the metrics shown above, we can aggregate it. It depends on the implementation, but usually, we would calculate the average value, minimum and maximum. And eventually, we will send something like this:
{“min”:3, “max”: 18, “avg”: 9.7, “count”: 97}
And of course, only these 4 numbers will describe 10 seconds living of the application. And according to our purpose, we should pick the right one. For example, an influx request for getting average request count per second should look like this
SELECT mean(“avg”) FROM “logins.count”
But what does “mean” mean?
When you look at a graph and scope, it is relatively large; but in like 5 minutes, Grafana, by default, can draw every single point on the graph. In other words, Grafana can show you all the data from the database as-is. But when you want to take a look at logins for 10 days, Grafana can’t draw all the points from the database and it will aggregate points in some way. In other words, it will merge several points to one. And “mean” from the query
SELECT mean (“avg”) FROM “logins.count”
means that Grafana will apply “average” aggregation function.
If you would like to count how many users logins a day, the request should be:
SELECT sum(“count”) FROM “logins.count”
In this case, we should use another aggregative function — “sum” — because we want to have the total amount of the requests.
If you have 2+ server aggregation, it becomes much more interesting.
If you have only one instance of an app, you can expect that every 10/30/60 seconds you have only one aggregation point in a database. But if you have several instances of an application, every instance aggregates metrics and sends it to the database. It means if 3 application instances send metrics every 10 seconds, then you will have 3 points of measures for every 10 seconds. Be careful, applications do not send metric at the same moment. They have their own opinion on when to send it.
And when Grafana shows it, the database should aggregate these 3 points to one. Again, the aggregate function depends on your purpose. If you would like to know how many users registered in the app, your query should be
SELECT sum(“count”) FROM “logins.count”….
We are using “sum” function because we should collect all the registrations from the servers.
But if you would like to know how many users log in to one server in average, you should apply the “mean” or “median” function:
SELECT mean(“count”) FROM “logins.count”….
And, in the end, if you would like to know how many users went to the most highly loaded server, you should use “max” function.
Aggregation Summary:
There are 2 levels of aggregation:
1) On the application level, before sending metrics, an app aggregates metrics for some time and sends it. It is common for an application to apply several aggregative functions for the metric: min, max, avg, and count.
2) On a database (or Grafana) level. When Grafana picks data from the database, it should aggregate several points. It uses the aggregative function of the database. Most commonly used are: mean, median, max, count, sum, last. Other functions are perfectly described here.
What to measure? Or measuring patterns
There are plenty of things that we can measure. But here, I will describe the most commonly used patterns. I will use grafana and influx in my examples, but you can easily adapt examples to your tools.
Amount.
It shows how many things happen during a period of time. Typical examples are “packages lost for 30 minutes”, “errors occurred last week”, “logins a day” and so on.
Typical values are “count” or “value”
Aggregative function: “Sum”.
It is very important to understand what the period of time represented by one dot or bar is. Depends on the scale, Grafana aggregates several measurements into one point. It will sum all the numbers during the period. And, of course, you should know what kind of the period. There is a big difference between the amount of packages received in an hour and the amount received in 10 minutes.
It is very convenient to add $__interval label to Alias.
Take a look at Alias, there is an interval used.
Throughput.
It is the rate of production or the rate at which something is processed.
Typical units are “per sec”, “per minute”. Examples: Requests per second, Ethernet throughput MBytes\s.
Aggregative functions are: avg, median.
Values should be normalized to basic unit like seconds. Most probably, an application sends metrics not every second but maybe every 10 seconds or every 30 seconds. It means that the database contains information with a frequency of n seconds.
For example, we would like to show HTTP request rate per second for our endpoint. Our application sends metrics every 10 seconds. It means that request `SELECT mean(value) FROM requests` show us aggregated rate with a dimension of 10 seconds. It is inconvenient, and we should normalize it to a second. In this case, it means that the value should be divided by 10.
Extremums (Max\Min Throughput)
Sometimes, it is very important not to cross any line; when you call some service, you should take care not to overload it. Most likely, you may like to measure maximum throughput.
In this case, you should pick “max” value and use “max” aggregative function.
Current State
It shows the current state of an application, like a free memory at the moment, CPU load and so on.
Aggregative function: last
Graphs are awesome, but sometimes we are not interested in historical data; we want to know what is going on right now. For this purpose, we can use the Grafana’s gauge.
Summary information
Sometimes, it is very important to show historical data, but also, if we overload a dashboard, it will be hard to understand what’s going on. Do not neglect to use gauges when you want to show summary information, like count of logins a day or average throughput per month.
How to implement it with Influx and Grafana? Just please wait for the final part.