The Mon-ifesto Part 2: Alerting and Graphing
A 3-Part Guide to Better Application Monitoring
Part 2 of a 3-part series on Monitoring. You can read Part 1 here.
Once your metrics have been identified, baseline established, and data flowing, then what? Next is setting up alerting.
There are four types of alerts that your metrics will generate: low priority, medium priority, high priority, and informational. Figuring out which alarms generate which alerts and routing them to the proper channels will go a long way to reducing “pager fatigue” and making your operations teams happy (and less sleep deprived). Let’s take a look at each alert type and some suggestions on how to route them.
Please note that SLAs on these alert types are general guidelines; your app SLAs will probably be different but these are a good starting point. In this context, an SLA is the time from alert generation to acknowledgement by an engineer, not a time-to-resolution.
1. High Priority Alerts (SLA: Minutes)
High priority alerts are the most severe level of alerts your application will generate. These alerts will often be for large, sudden spikes in one of your four key metrics (calls per minute, error rate, response time, bandwidth saturation). Notice I said, “large sudden spikes” — you don’t want every spike to generate a high priority alert, just the ones that indicate a possible failure condition for your app. Additionally, high priority alerts are for general app availability; if your app is unreachable then you have a major problem.
Routing high priority alerts is fairly straightforward: if you have integration with your company’s NOC send the alerts to them. If not, an operations management tool is invaluable. A tool like this will allow you to call your engineers’ phones to alert them to the issue. High priority alerts are the kind of alerts you wake people up for, so don’t be shy about setting up loud alarms. You need your engineers to respond to these the same way firefighters respond to a five-alarm fire.
In addition to paging your engineers, you will also want to display these alerts somewhere. Having a running board of what’s going on is vital to making sure the entire team is on the same page and understands what the priorities are. A browser-based event board is a good solution for this, but you can also use TVs or projectors if your team is in the same office.
2. Medium Priority Alert (SLA: Hours)
Let’s start off by talking about how we route medium alerts. You don’t want to sound alarms with these alerts, but you also want to make sure your medium alerts are visible to your engineers. For this I would recommend a two-pronged approach: route your alerts to both a tool like Slack, Hipchat, Teams, etc. and your alert board. Using a real-time chat application gives your on-call engineers a real-time alert channel while giving your off-duty engineers the ability to ignore the alerts.
As for what kind of alerts should be considered medium priority, you typically want to look at sustained elevated metrics or sudden spikes in metrics that do not present your app with a potential failure condition. Additionally, if you wish, you can include host-down notifications in the medium priority category. If you have architected your application service correctly, a non-critical mass of down hosts should not present a crisis situation. However, down hosts combined with unknown future traffic patterns means you have no idea if you are about to see a flood of incoming traffic or not, and should plan for the worst.
3. Low Priority Alerts (SLA: Days)
Low priority alerts are the lowest priority alerts that your engineers will need to act on. These alerts can be for anything that needs some kind of human intervention — low disk space, CPU spikes, dropped network packets, the like. The reason we treat these alarms as low priority is that, as discussed in Part 1, any serious issue with the underlying app infrastructure or environment will trickle up to your four key metrics. On their own, things like CPU usage don’t mean much, but an engineer should look into it when they have some time and see if there’s a deeper issue at play.
Routing these alerts is simple: send them to JIRA or some other task-tracking app. Since the SLA is measured in days, these alerts can safely sit in a JIRA queue for your engineering teams to pick off one by one when not engaging higher priority alerts. Additionally, don’t be surprised if most of the low priority issues are quickly closed as “could not reproduce”, often these alerts are transient issues due to factors outside the control of the engineering team (file backup caused a network I/O bottleneck, etc.).
If you have the development capacity, low priority alarms are very good targets for automating your alarm response. Often an issue will have a repeatable resolution (restart the machine, clear the /tmp folder, etc.) and automating these tasks is a good way to free up cycles for your engineers to work on more important tasks.
4. Informational Alerts (SLA: None)
Up until now, we have been talking about alerts that require human intervention in some capacity. Informational alerts do not; they can hardly be called “alerts” since they are more akin to “notices” or “posts”. This is where a system like Splunk or an ELK stack can be useful; you can shunt these notices to some kind of text aggregation system for later analysis or correlation if you wish. These notices can be useful for building metric dashboards that track statistics like app deployments or JVM restarts.
Now that we have defined our alert levels and how we’re going to notify our engineers to issues, we need to figure out how exactly we’re going to respond to these alarms.
DISCLOSURE STATEMENT: These opinions are those of the author. Unless noted otherwise in this post, Capital One is not affiliated with, nor is it endorsed by, any of the companies mentioned. All trademarks and other intellectual property used or displayed are the ownership of their respective owners. This article is © 2018 Capital One.