Extracting useful duration metrics from HAProxy — Prometheus & Fluentd

Tom Fawcett
Mar 5, 2018 · 3 min read

Background

One of HAProxy’s killer features is its rich metric output.

HAProxy outputs over 80 different metrics in a CSV format, which can be parsed by numerous tools. My personal favourite being Prometheus HAProxy Exporter. Once the metrics are in Prometheus it is easy to create monitoring for Rate and Error, but the third letter of RED — Duration — is somewhat absent.

HAProxy exposes the following duration metrics:

58. qtime [..BS]: the average queue time in ms over the 1024 last requests
59. ctime [..BS]: the average connect time in ms over the 1024 last requests
60. rtime [..BS]: the average response time in ms over the 1024 last requests (0 for TCP)
61. ttime [..BS]: the average total session time in ms over the 1024 last requests

On the surface these sound potentially useful, but in practice I have found them to be an incredibly vague representation of the true duration performance of the system. This shouldn’t be surprising considering they are all averages over the last 1024 requests.

Fortunately HAProxy has another killer feature — rich logging.

Metrics from logs

HAProxy logs via syslog. This provides a great deal of flexibility and potential for log shipping and processing. The HAProxy docs have plenty of detail about the field format and configuration, but the key field for this challenge is:

Ta: total active time for the HTTP request, between the moment the proxy received the first byte of the request header and the emission of the last byte of the response body.

Now that sounds useful.

A Prometheus histogram of Ta would complement the Rate and Error metrics extracted via HAProxy Exporter nicely. As I already happen to be processing these logs via Fluentd, I can use the Fluentd Prometheus plugin to create such a histogram.

Config samples

Now for some config samples showing how to create this histogram.

(Showing only log config)

global
log localhost:5140 format rfc5424 local0
...
defaults
mode http
log global
option httplog
...

With three plugins installed: fluent-plugin-rewrite-tag-filter, fluent-plugin-prometheus, and fluent-plugin-record-modifier.

# Input and parse HAProxy syslog.
<source>
@type syslog
tag haproxy
port 5140
bind 127.0.0.1
<parse>
@type regexp
# Matches HAProxy 1.8 rfc5424 HTTP logs, with a catch-all for other log types.
expression /^\d{1,2} (?<syslog_time>[^ ]+) (?<syslog_host>[^ ]+) (?<ps>\w+) (?<pid>\d+) ([^ ]+) ((\[(.*)\]|[^ ])) ((?<client_ip>[\w\.]+):(?<client_port>\d+) \[(?<req_date>.+)\] (?<frontend>[\w\~-]+) (?<backend>[\w-]+)\/(?<backend_server>[^ ]+) (?<tr_receive_time>[^\/]+)\/(?<tw_queue_time>[^\/]+)\/(?<tc_connect_time>[^\/]+)\/(?<tr_resp_time>[^\/]+)\/(?<ta_active_time>[^\/]+) (?<status_code>\d+) (?<bytes_read>\d+) (?<req_cookie>[^ ]+) (?<res_cookie>[^ ]+) (?<term_state>[\w-]+) (?<actconn>\d+)\/(?<feconn>\d+)\/(?<beconn>\d+)\/(?<srv_conn>\d+)\/(?<retries>\d+) (?<srv_queue>\d+)\/(?<backend_queue>\d+) \{?(?<req_headers>[^}]*)\}? ?\{?(?<res_headers>[^}]*)\}? ?"(?<request>[^"]*)"|(?<message>.+))/
types actconn:integer,backend:string,backend_queue:integer,beconn:integer,backend_server:string,bytes_read:integer,client_ip:string,client_port:integer,feconn:integer,frontend:string,message:string,pid:integer,ps:string,req_date:string,req_headers:string,req_cookie:string,request:string,res_headers:string,retries:string,res_cookie:string,srv_conn:integer,srv_queue:integer,status_code:integer,syslog_time:string,term_state:string,tc_connect_time:integer,tr_resp_time:integer,tr_receive_time:integer,ta_active_time:integer,tw_queue_time:integer
</parse>
</source>

# Tag our log types.
<match haproxy>
@type rewrite_tag_filter
<rule>
key frontend
pattern .
tag haproxy.http
</rule>
<rule>
key message
pattern .
tag haproxy.general
</rule>
</match>

# Setup the Prometheus plugin.
<source>
@type prometheus
bind 0.0.0.0
port 24231
metrics_path /metrics
</source>

# Following Prometheus best practice by converting to base unit.
<filter haproxy.http>
@type record_modifier
<record>
ta_active_time_seconds ${record["ta_active_time"].to_f / 1000}
</record>
</filter>

# Create the histogram.
<filter haproxy.http>
@type prometheus
<metric>
name http_request_duration_seconds
type histogram
desc A histogram of the request duration.
key ta_active_time_seconds
# Buckets chosen in mind of relevant SLO.
buckets 0.1, 0.5, 1, 5, 10, 60
<labels>
tag ${tag}
# Could also label by backend server if desired.
backend ${backend}
</labels>
</metric>
</filter>

Scraping Fluentd on port 24231 then shows:

# TYPE http_request_duration_seconds histogram
# HELP http_request_duration_seconds A histogram of the request duration.
http_request_duration_seconds_bucket{tag="haproxy.http",backend="foobar",le="0.1"} 362.0
http_request_duration_seconds_bucket{tag="haproxy.http",backend="foobar",le="0.5"} 362.0
http_request_duration_seconds_bucket{tag="haproxy.http",backend="foobar",le="1"} 812.0
http_request_duration_seconds_bucket{tag="haproxy.http",backend="foobar",le="5"} 950.0
http_request_duration_seconds_bucket{tag="haproxy.http",backend="foobar",le="10"} 986.0
http_request_duration_seconds_bucket{tag="haproxy.http",backend="foobar",le="60"} 986.0
http_request_duration_seconds_bucket{tag="haproxy.http",backend="foobar",le="+Inf"} 986.0
http_request_duration_seconds_sum{tag="haproxy.http",backend="foobar"} 1074.0
http_request_duration_seconds_count{tag="haproxy.http",backend="foobar"} 986.0

Perfect!

Using the metrics

There are a number of good examples in the Prometheus histogram docs of histogram based queries, including an approximation of Apdex score. With that in mind I will include only one example:

histogram_quantile(0.50,sum(rate(http_request_duration_seconds_bucket{job="fluentd",backend="foobar"}[5m])) by (le))

That returns the 50th percentile (median) of request durations over the last 5 minutes for backend foobar.

Conclusion

So to summarise — Fluentd combined with its Prometheus plugin allows you to create Prometheus duration histograms out of HAProxy logs. From these you can easily calculate percentiles and means, allowing effective monitoring of duration performance.

I was only interested in a single HAProxy HTTP log field — Ta, but the theory should be applicable to TCP logs and other duration fields. Similarly you could configure additional labels to gain further insights — e.g. labelling by status_code would allow a more accurate Apdex score.

If Fluentd doesn’t take your fancy, you could have a go with mtail; though that is file orientated and lacks first class support for histograms.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store