Keeping an Eye on Your Systems

Aidan Feldman
NYC Planning Tech
Published in
5 min readNov 20, 2018

--

For any team that’s deploying things on the web, there are a handful of things you want to monitor to ensure your applications remain healthy.

Uptime

If there’s a single monitor to put in place for web services, it’s uptime. Uptime monitors make requests to one or more provided URLs and keep track of the response times. Examples below are from the Planning Labs Vector Tile Server status page.

Breakdown of how many requests took different amounts of time

A certain threshold of consecutive failed request constitutes “downtime.” The overall “uptime” of the service is the fraction of a given period of time requests to it were successful.

Overall uptime report
Uptime by day

Basic monitors will check things like “does the homepage return OK,” and maybe checking to make sure it contains some certain text. Uptime monitors can also be useful for more in-depth requests like checking that a search query returns a certain record, or that an API call returns a minimum number of records. This tests that not only is the page rendering, but the database any everything else behind it is intact.

With alerts for uptime monitors configured, the team will get notified as soon as a service is down, potentially before their users do. Uptime monitors can also be useful to the team providing a service to understand how well they’re doing ensuring the service stays available. Services that need to be “highly available” usually have minimum uptime they are required to meet.

See the uptime for all Planning Labs services on the status dashboard.

Analytics

Analytics can be useful for understanding user behavior — see the previous Planning Labs post for more information. From a site reliability perspective, analytics can also help warn if (part of) an application isn’t working. For example, if there’s a user flow that involves filling out a form, site maintainers can track the number of users that reach various stages. With enough volume of traffic, seeing a sudden dip in completion rates would indicate that something changed for the worse, and possibly that the form is broken.

See how to set up custom alerts in Google Analytics, or whatever tool you’re using.

Errors

Error (a.k.a. exception) monitoring is a particularly useful type of monitoring, because it alerts the team to bugs out in the real world. Error-tracking services capture the stack trace and other contextual information (like the URL) to give the developers an idea of what might have triggered the error, so that they can hopefully reproduce and fix it. Error monitoring can be set up on the backend and/or the frontend, though the latter is harder.

Operating system

If you’re running your own server(s), you’ll want to understand their vital signs. Operating system monitoring consists of tracking operating system statistics over time. DigitalOcean’s monitoring, for example, provides:

  • CPU usage
  • Memory usage
  • Disk I/O (input/output) — how much is the disk being read/written
  • Disk usage — how full is it
  • Bandwidth usage — how much data is coming in/out
  • Top processes — what’s taking up the most CPU/memory

Here’s what those look like graphed all together:

Charts of monitored server statistics

With the data collected in one place, most monitoring tools will allow you to (as you can see above) hover over the charts to see what the values were for all the different statistics at any given time. This helps to show if, for example, the bandwidth dropped at the same time the memory spiked.

With the monitoring in place, we could then set up a handful of alerts to notify the team when any of the server statistics were particularly high.

List of server alert policies

Monitoring is useful for diagnosing a problem once it’s happened; alerting is useful for notifying you about a problem before it gets worse. What constitutes “normal” or “abnormal” will vary by service, so it will take some tweaking to get the thresholds right.

Notifications

Planning Labs uses Slack for communication amongst the team, so it made sense to also leverage it for notifications. GitHub (which the team uses for hosting code) posts when pull requests are created or merged, and Dokku (which the team uses for hosting) posts whenever there’s a deploy. These messages are interleaved, giving a real-time view of what’s happening across various services.

Slack notifications for creation of a pull request, followed by a merge, followed by a deploy

The channel can get a bit noisy, so many teams that have a channel like this end up ignoring or muting it. Even so, the contents are useful troubleshooting, for being able to see the order of events.

Takeaways

Monitoring is great, but is most useful when it triggers alerts proactively. That being said, you don’t want alerts being too proactive, or they will annoy the team and get ignored.

Similarly, frequent and detailed logs and notifications can be useful for troubleshooting, but can lead to information overload if they need to be read all the time.

In short: you will need to be careful and creative about separating the signal from the noise. Good luck!

--

--