Latency graphs for OSD for influx/grafana

Published in

OpsOps

3 min readJul 31, 2017

There aren’t much information on Ceph performance metrics. The most of publicly available Nagios checks stuck with ‘ceph health’ output, and the most articles I saw do not go beyond that level, counting 10 ‘most important ceph metrics’.

I spend about two weeks building a custom dashboard for Ceph, which digs a bit deeper into specifics of different subsystems of Ceph.

During this process I found an extremely frustrating problem: sum/avgcount issue for most latency metrics in ceph perf data is really hard to handle.

Setup

We run Telegraf with Influx, with Grafana on top. I’ll show examples of trivial op_r_latency (read latency) for OSD’s, but this approach is applicable to any sum/avgcount pair. Our op_r_latency is stored in database ‘ceph’, with tags ‘collection’, ‘host’ and ‘id’.

Avgcount/sum

All latencies in Ceph are returned as pair of numbers. Telegraf translates them into two metrics. For op_r_latency they will become a op_r_latency.avgcount and op_r_latency.sum. Sum contains sum of total time of all (read) operations, and avgcount contains number of operations. To get average latency one need to divide that sum by avgcount:

avg_latency = op_r_latency.sum/op_r_latency.avgcount

In case of timeseries (influx), we can calculate ‘current average’:

last_avg_latency = last(op_r_latency.sum)/last(op_r_latency.avgcount) from "ceph" where collection='osd' group by 'id'

Problems with this approach:

latency is “average” on unspecified time. In reality it calculated from OSD daemon start and may not represent any meaningful value. Different OSDs start at different time, so number may convey different meaning between OSDs. Here people even reset those counters to keep number reasonable.
Any attempt to draw it as a graph will show too few changes. We can add filter to influx query and even ditch ‘last()’ function to get ‘time-dependent’ sequence for visualizing, but it has almost no meaning, as cumulative numbers ignore local changes. Locality by itself is dependent on time, so graph is screwed.

Introducing calculus

We need to calculate dLat/dTime for each point on a graph.

dLat(t1, t2)  =  (Sum(t2) - Sum(t1))  /  (count(t2) - count(t1))

Sounds simple.

Influx complications

Th problem stops been simple when you move from simple calculus of smooth-defined-everywhere functions into multidimensional time-series with discreet steps between dots.

Complications:

We need to have GROUP by time(x) to have time series.
We can’t perform math on random intervals, as we need to have at least one sum/avgcount on each interval or we’ll get really bad mess on deltas.
Influx rejects math on derivative function results if that math involves series values. (read: we can not divide one derivative on other derivative).

I found no solution for problem with a simple query. The solution I present below involves nested queries, which become available since influx version 1.2.

The solution

SELECT dsum/dcount from(SELECT non_negative_derivative(max(“op_r_latency.avgcount”), 1s) as dcount, non_negative_derivative(max(“op_r_latency.sum”),1s) as dsum FROM “$policy”.”ceph” WHERE “collection” = ‘osd’ AND “id” = ‘$osd’ AND $timeFilter GROUP BY time($__interval) fill(previous))

I put outer query a bit apart from inner query to help to read it. Inner query return two derivatives for sum and for avgcount, and outer query just divide them (as series).

Notes:

fill(previous) helps us to avoid sudden jerks if some sample was lost or OSD rebooted
max is not for displaying ‘max’ value, they just takes a largest of all (cummulative) values in sequence. May be last() will work too. I do not recommend using mean(), as it is sensitive to number of lost samples.
$__interval should be more or equal to gathering ratio.