Load Balancers’ SRE Golden Signals

Steve Mushero
Nov 10, 2017 · 8 min read

Load Balancers are key components of most modern systems, usually in front of an application, but increasingly inside systems, too, supporting Containers, socket services, databases, and more.

There are several popular LBs in use, so we’ll cover the three most common:

For Nginx, see the Web Server Section.

Some General Load Balancer Issues

First, some general things to think about for Load Balancers, which are much more complex than web servers to monitor.

Combined or Separate Frontends

There are two ways to look at Load Balancer signals — from the frontend or the backend. And for the frontend, we may have several of them, for different parts of the site, for APIs, etc.

We are usually interested in the OVERALL signals for the LB, and thus all frontends, though we might also want to track each frontend separately if there are actually separate systems for Web, App, API, Search, etc. This means we’ll have separate signals for all of these.

Note some systems have separate Listener/Backends for HTTP and HTTPS, but they serve mostly the same URLs, so it usually makes sense to combine them into a single unified view if you can.

Monitoring Backend Servers in the LB vs. Web Servers

LBs usually have several backend servers, and we can get signals for EACH backend server, too. In a perfect world, we wouldn’t need this, as we could get better data directly from the backend web/app servers themselves.

However, as we’ll see in the Web and App Server sections, this is not really true, for web server monitoring sucks. This means it’s usually easier and better to get per-server backend server signals from the LB, rather than from the web servers themselves.

We’ve included notes on how to to this in the relevant sections below.

HAProxy

HAProxy is the most popular non-cloud LB, as a powerful, flexible, and very high-performance tool heavily used by everyone. HAProxy also has powerful logging methods and a nice UI, though there are some tricks to getting useful data from it — in some cases the data is so rich, we have to pick and choose what to get.

Caution — Single vs Multi-Process HAProxy

Most, if not all, versions of HAProxy report statistics for a single-process, which is okay for 99% of use cases. Some very-high performance systems use multi-process mode, but this is hard to monitor, as the stats are pulled randomly from one process. This can be a mess, so avoid if possible.

Caution — Complexity

All of the useful HAProxy stats are PER Listener, Backend, or Server, which is useful, but complicates getting a full picture. Simple websites or apps usually have a single (www.) listener and backend, but more complex systems usually have quite a few. It’s easy to gather hundreds of metrics and get confused.

You can decide if you want to track the Signal per Listener/Frontend or sum them up to get a total view — this depends on how unified your system is. As noted, above, you usually want to combine HTTP & HTTPS if they serve the same URLs. However, if you have separate Web, App, API, Search, etc. then you probably want to separate the signals, too.

HAProxy, how shall I monitor thee?

There are three ways to monitor HAProxy, all of which use the same format.

See the HAProxy documentation for details on how to access each of these, as this will greatly depend on your monitoring system and tools.

Mapping our signals to HAProxy, we see a bit of complexity:

AWS ELB & ALB

The AWS ELB/ALB Load Balancers are extremely popular for any AWS-based systems. They started out with simple ELBs and have evolved into full-fledged and powerful balancers, especially with the introduction of the new ALBs.

Like most AWS services, metrics are extracted via a combination of Cloud Watch and logs pushed to S3. The former is pretty easy to deal with, but dealing with S3-located logs is always a bit challenging so we try to avoid those (in part as we can’t really do real-time processing nor alerting on them).

Note the below are for HTTP, but the ELB & ALB have additional metrics for TCP-based connections that you can use in similar ways.

Details are available in the ELB CloudWatch Documentation.

Classic ELB

ELB metrics are available for the ELB as a whole, but not by backend group or server, unfortunately. Note if you only have one backend-server per AZ, then you could use the AZ Dimensional Filter.

Mapping our signals to the ELB, we get all of these from CloudWatch. Note the sum() part of the metrics which are the CloudWatch statistical functions.

Caution for ELB Percentiling Persons

If you will do percentiles and statistics on these signals, be sure to read the cautions and issues in the “Statistics for Classic Load Balancer Metrics” section of the CloudWatch docs.

New ALB

The ALB data is very similar to the ELB, with more available data and a few differences in metric names.

ALB metrics are available for the ALB as a whole, and by Target Group (via Dimension Filtering), which is how you can get the data for a given set of backend servers instead of monitoring the Web/App servers directly. Per-server data is not available from the ALB (though you can filter by AZ, which would be per-server if you have only one target backend server per AZ).

Mapping our signals to the ELB, we get all of these from CloudWatch. Note the sum() part of the metrics which are the CloudWatch statistical functions.

Caution for ALB Percentiling Persons

If you will do percentiles and statistics on these signals, be sure to read the cautions and issues in the “Statistics for Classic Load Balancer Metrics” section of the CloudWatch docs.

Next Service: Web Servers

Steve Mushero

Written by

CEO of Siglos.io & ChinaNetCloud — Internet Ops Guy in Shanghai & Silicon Valley