Extracting useful duration metrics from HAProxy — Prometheus & Fluentd

Tom Fawcett
Mar 5, 2018 · 3 min read

Background

One of HAProxy’s killer features is its rich metric output.

HAProxy outputs over 80 different metrics in a CSV format, which can be parsed by numerous tools. My personal favourite being Prometheus HAProxy Exporter. Once the metrics are in Prometheus it is easy to create monitoring for Rate and Error, but the third letter of RED — Duration — is somewhat absent.

HAProxy exposes the following duration metrics:

58. qtime [..BS]: the average queue time in ms over the 1024 last requests
59. ctime [..BS]: the average connect time in ms over the 1024 last requests
60. rtime [..BS]: the average response time in ms over the 1024 last requests (0 for TCP)
61. ttime [..BS]: the average total session time in ms over the 1024 last requests

On the surface these sound potentially useful, but in practice I have found them to be an incredibly vague representation of the true duration performance of the system. This shouldn’t be surprising considering they are all averages over the last 1024 requests.

Fortunately HAProxy has another killer feature — rich logging.

Metrics from logs

HAProxy logs via syslog. This provides a great deal of flexibility and potential for log shipping and processing. The HAProxy docs have plenty of detail about the field format and configuration, but the key field for this challenge is:

Ta: total active time for the HTTP request, between the moment the proxy received the first byte of the request header and the emission of the last byte of the response body.

Now that sounds useful.

A Prometheus histogram of Ta would complement the Rate and Error metrics extracted via HAProxy Exporter nicely. As I already happen to be processing these logs via Fluentd, I can use the Fluentd Prometheus plugin to create such a histogram.

Config samples

Now for some config samples showing how to create this histogram.

HAProxy config

(Showing only log config)

Fluentd config

With three plugins installed: fluent-plugin-rewrite-tag-filter, fluent-plugin-prometheus, and fluent-plugin-record-modifier.

Scraping Fluentd on port 24231 then shows:

Perfect!

Using the metrics

There are a number of good examples in the Prometheus histogram docs of histogram based queries, including an approximation of Apdex score. With that in mind I will include only one example:

That returns the 50th percentile (median) of request durations over the last 5 minutes for backend foobar.

Conclusion

So to summarise — Fluentd combined with its Prometheus plugin allows you to create Prometheus duration histograms out of HAProxy logs. From these you can easily calculate percentiles and means, allowing effective monitoring of duration performance.

I was only interested in a single HAProxy HTTP log field — Ta, but the theory should be applicable to TCP logs and other duration fields. Similarly you could configure additional labels to gain further insights — e.g. labelling by status_code would allow a more accurate Apdex score.

If Fluentd doesn’t take your fancy, you could have a go with mtail; though that is file orientated and lacks first class support for histograms.