Precise Rate Charts Using Graphite with Client-Side Aggregation

Petr Hájek
Omio Engineering
Published in
7 min readMay 27, 2022

Let’s talk about how to create Rate Charts for Low Traffic services with the use of:

  • Graphite as the metric storage
  • Graphite metrics reported directly by application (e.g. using Micrometer) with the Client-Side Aggregation (not server-side aggregation like StatsD)
  • Grafana as visualisation platform

Rates are easy, but not in cases of low volume

Graphite is a widely used tool for storing metrics. The simplest model of Graphite use is that every application instance (e.g. Kubernetes Pod) reports metrics directly to the Graphite server. Graphite acts as metric storage, but is not a metrics aggregator. That means that Graphite Reporters like Micrometer need to do the “Client-Side” Aggregation. So every application instance does aggregation in memory.

For Counters, Graphite Reporters usually perform 2 kinds of aggregations:

  • mX_rate (e.g. m1_rate, m15_rate…) — which represents requests per seconds calculated on X minute(s) interval (to be precise Exponentially Weighted Moving Average (EWMA) on a X minute(s) period)
  • Incremental counter which keeps increasing value from the startup of the application till the shutdown

Let’s say that we have an application running in 2 instances (pod-…2mcrf and pod-…5fdxr) and the events looks like this

Now we want to calculate requests per 10 minutes. One would expect that mX_rates is the best solution for calculating any rate. So to get requests per 10 minutes, just do m15_rate * 60 * 10. But let’s see what m15_rate metric looks like.

Why doesn’t it follow what’s happening in reality? Because it’s using the Client-Side Aggregation optimized for the higher volume of events. Thus, it’s useful for applications with a lot of events, but it’s useless for those with lower numbers.

In that case, let’s check the second out-of-the-box aggregation from Micrometer: Incremental counter.

This one seems to align with reality more, but it can’t be used out-of-the-box. This is where Graphite Functions come into play.

“nonNegativeDerivative” is the critical magic

To get the Rate, we can use the function nonNegativeDerivative which represents positive changes in the counter. Now the chart looks better, but there are still a few issues.

Notice the scale of the Y axes. It looks like we have fragments of events — the first bar shows 0.333 events. That doesn’t make a lot of sense.

Graphite reports fragments because it does aggregation over a given period of time. In our case, it aggregates to 1.5-minute periods and uses average, by default. Our Graphite Reporter reports the counter metric every 30 seconds (default behavior), so nonNegativeDerivative is calculated for every metric entry. Since we have 3 metric entries in the 1.5 minute, Graphite returns the average value: ⅓ = 0.333.

There are 2 ways to compensate it:

  • Either summarize series over period — summarize(nonNegativeDerivative(my-app.*.my-metric),"10min","sum")
  • Or change aggregation over the time period from “average” to “sum” by using consolidateBy(nonNegativeDerivative(my-app.*.my-metric),"sum")

For the sake of this post, we’ll go with summarize

Compensating network issues

It might happen, that due to network issues, not all metric entries will be delivered to Graphite Server. So when you zoom in, the counter aggregation might look like this:

In such a case you might miss a few events. Imagine the following case:

  • Time 00:00:00 — counter value is 1
  • Time 00:00:30 — counter increments to 2, but due to network issues metric entry doesn’t reach Graphite. In that case, Graphite in the query treats entry for 00:00:30 as “null”
  • Time 00:01:00 — counter is still 2

When we call nonNegativeDerivative, it’ll ignore the bump from 1 -> 2, because it sees change null -> 2.

To compensate for missing entries, we just add keepLastValue to the beginning.

Now we can see that the number of reported events will increase.

Going more precise yields complexity tradeoff

If you want to be really precise and have only a few events per metric, things will start to get even more complicated. If you have more than 50 events per minute, I’d suggest stopping optimizations here.

Find the initial event

If you need to be accurate, you have to ensure that the first event of the metric is calculated as well. Out-of-the-box it’s not. The counter metrics are being created in the app with the first increment. So what Graphite sees is that the counter metric goes from null -> 1. In such a case it’ll ignore this bump.

To include the first event, you can use transformNull(0).

You might also solve the issue in your Application by initialising all counters to 0 or 1 on the startup. Then the first event will be considered. But it’s not always easy in the applications, so we’re ignoring this option in this blog.

Network issues strike again

Having transformNull(0) will still cause issues when there are network issues (nulls) in the beginning of the chart window.

Consider the following case: Chart Window is from 12:00 -> 13:00 and the counter values are as follows

  • 11:59:30 — count=100
  • 12:00:00 — count=null ( service wasn’t able to reach Graphite)
  • 12:00:30 — count=null ( service wasn’t able to reach Graphite)
  • 12:01:00 — service.count=100

Since null is transformed to 0, the transition between 12:00:30 and 12:01:00 will be calculated as 100 by nonNegativeDerivative.

This spike will look like this:

Compensate by ignoring the start

The spike in the beginning is very nasty and randomly causes major discrepancies. One way to get rid of it is to ignore the first minutes of the chart period.

Ignoring the beginning is possible with the usage of the timeSlice function. If you want the chart to reflect from/to picker, you can try to use Grafana global variables like this:

timeSlice('${_from:date:HH}:${_from:date:mm_YYYYMMDD} -110min','now')

Unfortunately, this approach has a few drawbacks:

  • You have to manually handle timezone offset (during DST (Summer Time) use -110min, non-DST (Winter Time) use -50mins for CET/CEST)
  • When you hardcode Relative time in the Query Options of Grafana Chart, timeSlice will be incorrect as $_from is taken from Time Picker

Graphite and rollup

To get solid performance on long-term data, Graphite can perform a Rollup Aggregation. It’s important to have Rollup properly set up, otherwise you’ll get very imprecise results when querying for long-term data.

How Rollup works

After data is pushed to Graphite Server, it creates multiple aggregations (e.g. aggregate per 30s, 5m, 1hr). Then if you query Graphite for long-term data, it’ll use data from the longer Rollup Aggregation (e.g. when you query for the past 3 months data, Graphite will decide to use data from the 1hr aggregation).

Use Max over Average for Counters

Default Rollup Aggregation Strategy is to average values within the Rollup Window. This works well for mX_rate metrics, but not for counter metrics.

Consider the following example:

  • At 00:00:00 — counter is 10
  • At 00:00:30 — counter is 18
  • At 00:01:00 — counter is 20
  • At 00:01:30 — counter is 22

When Graphite uses Rollup per 1 minute, the Average Strategy will transform data to:

  • 00:00:00 — counter is 14
  • 00:01:00 — counter is 21

If we then run nonNegativeDerivative, it’ll say that at 00:01 there were 7 events. In fact, there were 4.

We need to set Rollup Strategy to aggregate with the usage of “max”, so the data will be transformed to:

  • 00:00:00 — counter is 18
  • 00:01:00 — counter is 22

So that nonNegativeDerivative will show 4 events at 00:01.

Summary

Graphite is a great tool, but using Graphite with Client-Side Aggregation can be tricky.

The main issue arises when you have a low volume of events. In such a case, if there is a 2 events discrepancy out of 10 events, the discrepancy is 20%. So you need to get rid of small discrepancies as well.

If you have a high volume of events, things are simpler. Because, for example, a 2 events discrepancy will not be significant. For high volume metrics, even the mX_rate aggregations work quite well.

Want to learn more? Come join us! We’re hiring for a number of roles, and would love to talk to you. In the meantime, check out Petr’s Q&A on how he thinks and how he got here.

Resources

--

--