Before writing down a formula for ‘average cluster latency’, you need to clarify what you want to find. What is ‘overall cluster latency’? Is it an average of all latencies for all OSDs? Or you want an average latency for all ceph IO operations at this time? Or you want to see only client-related latencies without background scrubbing and replication? It’s a completely different metrics.
If you care about some ‘gross indicator’ for Ceph cluster, you may wish to look into `op_r` and `op_w` metrics.
Here my metric for “Client’s read_iops” metric from our grafana dashboard. If you want to use it, you need to set up template parameters for your dashboard. I hope those names are self-explanatory:
SELECT sum(read_iops) from (SELECT non_negative_derivative(max("op_r"), 1s) as "read_iops" FROM "$data_source"."ceph" WHERE $timeFilter AND collection='osd' GROUP BY time(60s), "id" fill(none)) WHERE $timeFilter group by time(60s)