[ALERTING] When are critical alerts needed?
You build a great product. You offer it as a service. You define quality and performance Service Level Agreements (SLA) for your clients. You deploy monitoring.
And now you want to define alerting. To choose relevant metrics. To establish thresholds. Simply, to decide when your system needs manual intervention and you must alert the on-call team.
What metrics to choose for alerting? How many alerts to add? When to alert?
Strategy 1: Alert on everything. We love alerts.
- Treat all your system’s components/services as pets: Each container or VM instance is important and needs love and gentle care. Alert on each VM restart. Alert when CPU usage on one container out of 1000 is at 100 %. Wonder at your alerting volume.
- Alert that no work has been done: Alert that service throughput is zero. Alert that a service has not produced any logs in the last 15 minutes. Wonder why these alerts fire mostly during nights, weekends, or software deployments.
- Do not correlate alerting to client impact: You alert on what you think is bad for your system, not your clients. Can a CPU usage of 100% negatively impact your clients? Maybe a 10% chance. And it’s bad for your VM. So you alert on it. Can an accumulation of data negatively impact your clients? Maybe a 30% chance. And it’s bad for your storage. So you alert on it. Wonder why alerts don’t seem correlated with client impact.
- Alert on things you do not know how to fix: Do you think a situation might cause problems for your customers? Alert on it. Do you or your team know how to react to the alert? No, the alert is too generic. No, the system is too complex. Alert anyway. Let the on-call should figure out what to do when the alert fires. Wonder at your on-call team’s frustration with non-actionable alerts.
Strategy 2: Alert on client experience degradation. Automate everything else.
- Alert when client experience is impacted: The on-call team is called only when client experience is impacted. Or when impact is imminent if no action is taken immediately. If a critical alert fires and notifies the on-call, there should be a 90% chance that client impact is imminent. A container restarting should not be critical. High memory usage, not critical. Backup failure, not critical. Losing data, that is critical. Your service not being reachable by your clients, that is critical. Exceeding your latency SLA, also critical. Wonder how much your team trusts each alert.
- Automate everything else: All other alerts you think you need should be handled by automation. You want to alert that disk is filling up? Implement automation for storage scaling or data rebalancing. You want to alert that a service is stuck and needs to be restarted? Implement automation to restart or unblock it. Do not automate with alerts and people. Automate with automation. Wonder how resilient is your system.
Checklist: Do I need a critical alert on this metric?
Alerts that call people imply manual intervention. Manual intervention is slow and expensive. It disrupts life for the on-call team. It is hard to scale, as you can only scale it with people. And it is slow in recovering client experience after a failure. So, manual intervention should be the exception, not the norm. Automation should be the norm.
To this end, there are a couple of questions I find helpful in determining if I need to define a critical alert on a metric:
- Is the metric a Service Level Indicator tracking SLA fulfillment? Simply put, is the metric tracking client experience? Like latency, data accuracy, data loss/corruption, or service availability.
- Is the metric indicating widespread system failure? For example, is the network down? Is the computing cluster down?
- Is the metric indicating failure in third-party systems? Does the metric track if the third-party services I am using fulfill their SLAs? Do I need to engage their technical support when the alert will fire?
- Will there be a 90% chance of imminent client impact when the alert fires? What combination of metrics and thresholds achieves that?
- How often will it fire? Does the metric need smoothing? Will it fire too often? Will I desensitize the on-call team with too many alerts if I add it?
- Can we automate the response to this alert? If the answer is yes, then don’t alert on it. Give it a go and automate it. Alert only when automation cannot handle it.