Web Servers’ SRE Golden Signals

Part of the How to Monitor the SRE Golden Signals Series

Where would we be without web servers ? Probably with no web …

These days, the lowly web server has two basic functions: serving simple static content like HTML, JS, CSS, images, etc., and passing dynamic requests to an app backend such as PHP, Java, PERL, C, COBOL, etc.

Thus the web server frontends most of our critical services, including dynamic web content, APIs, and all manner of specialized services, even databases, as everything seems to speak HTTP these days.

So it’s critical to get good signals from the web servers.

Unfortunately, they don’t measure nor report this data very well, and not at all in aggregate (at least for free). Thus we are left with three choices:

  1. Use the very limited built-in status reports/pages
  2. Collect & aggregate the web server’s HTTP logs for this important data
  3. Utilize the upstream Load Balancer per-server metrics, if we can

The last choice, to use the LB’s per-backend server metrics, is usually the best way, and there are details in the above Load Balancer section on how to do it.

However, not all systems have the right LB type, and not all Monitoring Systems support getting this type of backend data — for example, this is quite hard in Zabbix, as it struggles to get data about one host (the web server) from another host (the LB). And the AWS ELB/ALB have limited data.

So if we must go to the Web Server Status Pages & HTTP logs, the following are a few painful, but worthwhile ways.

Preparing to Gather Web Server Metrics

There are a few things we need prepare to get the Logs & Status info:

Enable Status Monitoring

You need to enable status monitoring. For both Apache and Nginx, the best practice is to serve it on a different port (e.g. :81) or lock it down to your local network, monitoring systems, etc. so bad guys can’t access it.

  • Apache — Enable mod_status including ExtendedStatus. See DataDog’s nice Apache guide.
  • Nginx — Enable the stub_status_module. See DataDog’s nice Nginx guide.

Enable Logging

We need the logs, which means you need to configure them, put them in a nice place, and ideally separate them by vhost.

We need to get response time in to the logs, which unfortunately is not part of any standard nor default format. Edit your web configs to get this:

  • Apache — We need to add “%D” field into the log definition (usually at the end), which will get the response time in microseconds (use %T on Apache V1.x, but it only gets seconds). 
     
    This is the basic equivalent to the Nginx $request_time, below. Note there is an unsupported mod-log-firstbyte module that gets closer to the Nginx $upstream_time, which is really measuring the backend response time. See Apache Log docs.
  • Nginx — We need to add the ”$upstream_response_time” field for backend response time, usually at the end of the log line. Using this “backend time” avoids elongated times for systems that send large responses to slower clients.
  • Nginx also supports a “$request_time” field in the log definition. This time is until the last byte is sent to the client, so it can capture large responses and/or slow clients. 
     
    This can be a more accurate picture of the user experience (not always), but may be too noisy if most of your issues are inside the system vs. the client. See Nginx Log docs.
Note that many people also add the X-Forwarded-For header to the log. If that or other fields are present, you may need to adjust your parsing tools to get the right field.

Log Processing for Metrics

As you’ll see below, the most important web server signals, especially Latency, can only be obtained from the logs, which are hard to read, parse, and summarize. But do so we must.

There are quite a number of tools to read HTTP logs, though most of these focus on generating website data, user traffic, URL analyses, etc. We have a different focus, to get the Golden Signals — so first, we need a set of tools that can reliably read, parse, and summarize the logs in a way we can use in our monitoring systems.

You may very well have your favorite tools in this area, such as the increasingly-present ELK stack which can do this with a bit of work (as can Splunk, Sumologic, Logs, etc.) In addition, most SaaS monitoring tools such as DataDog can extract these metrics via their agents or supported 3rd-party tools.

For more traditional monitoring systems such as Zabbix, this is more challenging as they are not good native log readers nor processors. Plus, we are not very interested in the log lines themselves. Instead, we need aggregations of latency and error rate across time in the logs, which is counter to a lot of other ‘logging’ goals.

So if your monitoring system natively supports web server log metrics, you are all set, and see below. If not, there may be 3rd party or open source tools for this for your system, and you are all set, see below.

If you don’t have an existing monitoring system nor 3rd party / open source tool for this, you are somewhat out of luck, as we can’t find an out-of-the-box solution, especially to get this data in 1 or 5 minute blocks most useful to us.

The reality is log parsing and aggregation is harder than it appears, and there are very few tools that can do this on server to feed agent-based monitoring. It seems the GoAccess tool can do some of this, with CSV output you can parse. Otherwise, there are lots of good awk and PERL scripts around, but few that support a rolling time-window or even log rollover.

You need to find a system, tool, or service to extract metrics from web logs.

Mapping our signals to Web Servers, we have:

  • Request Rate — Requests per second, which you can get the hard wayby reading the access logs and count lines to get the total Requests, and do the delta to get Requests per Second. Or the easy way by using the server’s status info, and thus:
     
    Apache — Use Total Accesses, which is a counter of total number of requests in the process lifetime. Do delta processing to get requests/sec. Do NOT use “Requests per Second” which is over the life of the server and thus useless.
     
    Nginx — Use Requests, which is a counter of total number of requests. Do delta processing to get requests/sec.
  • Error Rate — This has to come from the logs, where your tools should count the 5xx errors per second to get a rate. You can also count 4xx errors, but they can be noisy, and usually don’t cause user-facing errors. You really need to be sensitive to 5xx errors which directly affect the user and should be zero all the time.
  • Latency — This has to come from the logs, where your tools should aggregate the Request or Response time you added to the logs, above. Generally you want to get the average (or better yet, the median) of the response times over your sampling period, such as 1 or 5 minutes.
     
    As mentioned above, you need the right tools for this, as there is useful script nor tool to do this in a generic way. You usually have to send the logs to an ELK-like service (ELK, Sumo, Logs, etc.), monitoring system (DataDog, etc.), or APM system like New Relic, App Dynamics, or DynaTrace.
  • Saturation — This is a challenging area that differs by web server:
     
    Nginx — It’s nearly impossible to saturate the Nginx server itself, as long as your workers & max connections is set high enough (default is 1x1K, you should usually set much higher, we use 4x2048K).
     
    Apache — Most people run in pre-fork mode, so there is one Apache process per connection, with a practical limit of 500 or so. Thus with a slow backend, it’s very easy to saturate Apache itself. Thus:
     
    Monitor BusyWorkers vs. the smallest of your configured MaxRequestWorkers/MaxClients/ServerLimit. When Busy = Max, this server is saturated and cannot accept new connections (new connections will be queued).
     
    You can also count HTTP 503 errors from the logs which usually happen when the backend App Server is overloaded, though ideally, you can get that data directly from the App Server.
     
     For many Apache systems, it’s critical to measure Memory as a resource because it’s by far the easiest way to kill an Apache-based system, to run out of RAM, especially if you are running modPHP or another mod-based app server.
  • Utilization — For Nginx this is rarely relevant but for Apache it’s the same as for Saturation (ratio of BusyWorkers vs. smallest of configured MaxRequestWorkers / Max Clients / Server Limit).

It’d be great to see someone write an Apache module or Nginx Lua/C module to report these signals in the same way that Load Balancers do. This doesn’t seem that hard and the community would love them for it.

Overall, useful web server monitoring of these signals is not easy, but you should find a way to do it, ideally upstream via your LBs, or using the methods outlined above.

Next Service: App Servers