Load Balancers’ SRE Golden Signals

Part of the How to Monitor the SRE Golden Signals Series

Image for post
Image for post

Load Balancers are key components of most modern systems, usually in front of an application, but increasingly inside systems, too, supporting Containers, socket services, databases, and more.

There are several popular LBs in use, so we’ll cover the three most common:

  • HAProxy — Everyone’s Favorite non-cloud LB
  • AWS ELB — Elastic Load Balance
  • AWS ALB — Application Load Balancer

For Nginx, see the Web Server Section.

Some General Load Balancer Issues

First, some general things to think about for Load Balancers, which are much more complex than web servers to monitor.

Combined or Separate Frontends

There are two ways to look at Load Balancer signals — from the frontend or the backend. And for the frontend, we may have several of them, for different parts of the site, for APIs, etc.

We are usually interested in the OVERALL signals for the LB, and thus all frontends, though we might also want to track each frontend separately if there are actually separate systems for Web, App, API, Search, etc. This means we’ll have separate signals for all of these.

Note some systems have separate Listener/Backends for HTTP and HTTPS, but they serve mostly the same URLs, so it usually makes sense to combine them into a single unified view if you can.

Monitoring Backend Servers in the LB vs. Web Servers

LBs usually have several backend servers, and we can get signals for EACH backend server, too. In a perfect world, we wouldn’t need this, as we could get better data directly from the backend web/app servers themselves.

However, as we’ll see in the Web and App Server sections, this is not really true, for web server monitoring sucks. This means it’s usually easier and better to get per-server backend server signals from the LB, rather than from the web servers themselves.

We’ve included notes on how to to this in the relevant sections below.

Image for post
Image for post

HAProxy

HAProxy is the most popular non-cloud LB, as a powerful, flexible, and very high-performance tool heavily used by everyone. HAProxy also has powerful logging methods and a nice UI, though there are some tricks to getting useful data from it — in some cases the data is so rich, we have to pick and choose what to get.

Caution — Single vs Multi-Process HAProxy

Most, if not all, versions of HAProxy report statistics for a single-process, which is okay for 99% of use cases. Some very-high performance systems use multi-process mode, but this is hard to monitor, as the stats are pulled randomly from one process. This can be a mess, so avoid if possible.

Caution — Complexity

All of the useful HAProxy stats are PER Listener, Backend, or Server, which is useful, but complicates getting a full picture. Simple websites or apps usually have a single (www.) listener and backend, but more complex systems usually have quite a few. It’s easy to gather hundreds of metrics and get confused.

You can decide if you want to track the Signal per Listener/Frontend or sum them up to get a total view — this depends on how unified your system is. As noted, above, you usually want to combine HTTP & HTTPS if they serve the same URLs. However, if you have separate Web, App, API, Search, etc. then you probably want to separate the signals, too.

HAProxy, how shall I monitor thee?

There are three ways to monitor HAProxy, all of which use the same format.

  • Pulling CSV data from the built-in web page
  • Using the CLI tool
  • Using the Unix Socket

See the HAProxy documentation for details on how to access each of these, as this will greatly depend on your monitoring system and tools.

Mapping our signals to HAProxy, we see a bit of complexity:

  • Request Rate — Requests per second, which we can get two ways:

    1. Request Count REQ_TOT is best, since as a counter it won’t miss spikes, but you must do delta processing to get the rate. Not available per server, so instead use RATE for servers (though this is only over the last second).

    2. You can use REQ_RATE (req/sec), but this is only over the last second, so you can lose data spikes this way, especially if your monitoring only gets this every minute or less often.
  • Error Rate — Response Errors, ERESP, which is for errors coming from the backend. This is a counter, so you must delta it. But be careful as the docs say this includes “write error on the client socket”, thus it’s not clear what kind of client error (such as on mobile phones) this might include. You can also get this per backend server.

    For more detail and just HTTP errors, you can just get 4xx and 5xx error counts, as these will be most sensitive to what users see. 4xx errors are usually not customer issues but if they suddenly rise, it’s typically due to bad code or an attack of some type. Monitoring 5xx errors is critical for any system.

    You may also want to watch Request Errors, EREQ, though realize this includes Client Closes which can create a lot of noise, especially on slow or mobile networks. Front-end only.
  • Latency — Use the Response Time RTIME (per backend) which does an average over the last 1024 requests (so it will miss spikes in busy systems, and be noisy at startup). There is no counter data for these items. This is also available per server.
  • Saturation — Use the number of queued requests, QCUR. This is available for both the Backend (for requests not yet assigned to a server) and for each Server (not yet sent to the server).

    You probably want the sum of these for overall Saturation, and per server if you are tracking server saturation at this level (see web server section). If you use this per server, you can track each backend server’s saturation (though realize the server itself is probably queuing also, so any queue at the LB indicates a serious problem).
  • Utilization — HAProxy generally does not run out of capacity unless it truly runs out of CPU, but you can monitor actual Sessions SCUR / SLIM.

AWS ELB & ALB

The AWS ELB/ALB Load Balancers are extremely popular for any AWS-based systems. They started out with simple ELBs and have evolved into full-fledged and powerful balancers, especially with the introduction of the new ALBs.

Like most AWS services, metrics are extracted via a combination of Cloud Watch and logs pushed to S3. The former is pretty easy to deal with, but dealing with S3-located logs is always a bit challenging so we try to avoid those (in part as we can’t really do real-time processing nor alerting on them).

Note the below are for HTTP, but the ELB & ALB have additional metrics for TCP-based connections that you can use in similar ways.

Details are available in the ELB CloudWatch Documentation.

Classic ELB

ELB metrics are available for the ELB as a whole, but not by backend group or server, unfortunately. Note if you only have one backend-server per AZ, then you could use the AZ Dimensional Filter.

Mapping our signals to the ELB, we get all of these from CloudWatch. Note the sum() part of the metrics which are the CloudWatch statistical functions.

  • Request Rate — Requests per second, which we get from the sum(RequestCount) metric divided by the configured CloudWatch sampling time, either 1 or 5 minutes. This will include errors.
  • Error Rate — You should add two metrics: sum(HTTPCode_Backend_5XX) and sum(HTTPCode_ELB_5XX), which captures server-generated errors and LB-generated (important to count backend unavailability and rejections due to full queue). You may also want to add sum(BackendConnectionErrors).
  • Latency — The average(latency). Easy.
  • Saturation — The max(SurgeQueueLength) which gets Requests in the backend queues. Note this is focused solely on backend saturation, not on the LB itself which can get saturated on CPU (before it auto-scales), but there appears to be no way to monitor this.

    You can also monitor and alert on sum(SpilloverCount) which will be > 0 when the LB is saturated and rejecting requests because the Surge Queue is full. As with 5xx errors, this is a very serious situation.
  • Utilization — There is no good way to get utilization data on ELBs, as they auto-scale so their internal capacity is hard to get (though would be nice to get before they scale, such as when things surge).

Caution for ELB Percentiling Persons

If you will do percentiles and statistics on these signals, be sure to read the cautions and issues in the “Statistics for Classic Load Balancer Metrics” section of the CloudWatch docs.

New ALB

The ALB data is very similar to the ELB, with more available data and a few differences in metric names.

ALB metrics are available for the ALB as a whole, and by Target Group (via Dimension Filtering), which is how you can get the data for a given set of backend servers instead of monitoring the Web/App servers directly. Per-server data is not available from the ALB (though you can filter by AZ, which would be per-server if you have only one target backend server per AZ).

Mapping our signals to the ELB, we get all of these from CloudWatch. Note the sum() part of the metrics which are the CloudWatch statistical functions.

  • Request Rate — Requests per second, which we get from the sum(RequestCount) metric divided by the configured CloudWatch sampling time, either 1 or 5 minutes. This will include errors.
  • Error Rate — You should add two metrics: sum(HTTPCode_Backend_5XX) and sum(HTTPCode_ELB_5XX), which captures server-generated errors and LB-generated (important to count backend unavailability and rejections due to full queue). You may also want to add sum(TargetConnectionErrorCount).
  • Latency — The average(TargetResponseTime). Easy.
  • Saturation — There appears to be no way to get any queue data from the ALB, so we are left with sum(RejectedConnectionCount) which counts rejects when the ALB reached its max connection count.
  • Utilization — There is no good way to get utilization data on ELBs, as they auto-scale so their internal capacity is hard to get (though would be nice to get before they scale, such as when things surge). Note you can monitor sum(ActiveConnectionCount) vs. the maximum connection count, which you must get manually or from AWS Config.

Caution for ALB Percentiling Persons

If you will do percentiles and statistics on these signals, be sure to read the cautions and issues in the “Statistics for Classic Load Balancer Metrics” section of the CloudWatch docs.

Next Service: Web Servers

Written by

CEO of ChinaNetCloud & Siglos.io — Global Entrepreneur in Shanghai & Silicon Valley

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store