Latency graphs for OSD for influx/grafana
There aren’t much information on Ceph performance metrics. The most of publicly available Nagios checks stuck with ‘ceph health’ output, and the most articles I saw do not go beyond that level, counting 10 ‘most important ceph metrics’.
I spend about two weeks building a custom dashboard for Ceph, which digs a bit deeper into specifics of different subsystems of Ceph.
During this process I found an extremely frustrating problem: sum/avgcount
issue for most latency metrics in ceph perf data is really hard to handle.
Setup
We run Telegraf with Influx, with Grafana on top. I’ll show examples of trivial op_r_latency
(read latency) for OSD’s, but this approach is applicable to any sum/avgcount pair. Our op_r_latency is stored in database ‘ceph’, with tags ‘collection’, ‘host’ and ‘id’.
Avgcount/sum
All latencies in Ceph are returned as pair of numbers. Telegraf translates them into two metrics. For op_r_latency
they will become a op_r_latency.avgcount
and op_r_latency.sum
. Sum
contains sum of total time of all (read) operations, and avgcount contains number of operations. To get average latency one need to divide that sum by avgcount:
avg_latency
= op_r_latency.sum
/op_r_latency.avgcount
In case of timeseries (influx), we can calculate ‘current average’:
last_avg_latency
= last(op_r_latency.sum)
/last(op_r_latency.avgcount) from "ceph" where collection='osd' group by 'id'
Problems with this approach:
- latency is “average” on unspecified time. In reality it calculated from OSD daemon start and may not represent any meaningful value. Different OSDs start at different time, so number may convey different meaning between OSDs. Here people even reset those counters to keep number reasonable.
- Any attempt to draw it as a graph will show too few changes. We can add filter to influx query and even ditch ‘last()’ function to get ‘time-dependent’ sequence for visualizing, but it has almost no meaning, as cumulative numbers ignore local changes. Locality by itself is dependent on time, so graph is screwed.
Introducing calculus
We need to calculate dLat/dTime
for each point on a graph.
dLat(t1, t2) = (Sum(t2) - Sum(t1)) / (count(t2) - count(t1))
Sounds simple.
Influx complications
Th problem stops been simple when you move from simple calculus of smooth-defined-everywhere functions into multidimensional time-series with discreet steps between dots.
Complications:
- We need to have
GROUP by time(x)
to have time series. - We can’t perform math on random intervals, as we need to have at least one
sum
/avgcount
on each interval or we’ll get really bad mess on deltas. - Influx rejects math on
derivative
function results if that math involves series values. (read: we can not divide one derivative on other derivative).
I found no solution for problem with a simple query. The solution I present below involves nested queries, which become available since influx version 1.2.
The solution
SELECT dsum/dcount from(SELECT non_negative_derivative(max(“op_r_latency.avgcount”), 1s) as dcount, non_negative_derivative(max(“op_r_latency.sum”),1s) as dsum FROM “$policy”.”ceph” WHERE “collection” = ‘osd’ AND “id” = ‘$osd’ AND $timeFilter GROUP BY time($__interval) fill(previous))
I put outer query a bit apart from inner query to help to read it. Inner query return two derivatives for sum and for avgcount, and outer query just divide them (as series).
Notes:
fill(previous)
helps us to avoid sudden jerks if some sample was lost or OSD rebootedmax
is not for displaying ‘max’ value, they just takes a largest of all (cummulative) values in sequence. May belast()
will work too. I do not recommend usingmean()
, as it is sensitive to number of lost samples.$__interval
should be more or equal to gathering ratio.