I work on the nascent SRE team at my company, and one of the first things we try to do as we embed…
Will Milton
41

Monitoring Your Monitoring’s Monitoring

These opinions are my own, and I am not speaking on behalf of my employer.

Will Milton asks a question, and this is an important quote from his question:

Any time anything [the monitoring platform] doesn’t work, it should negatively affect some SLO if that is going to be the mechanism for prioritizing work on metrics collection.

Your system’s SLO should be an accurate approximation of the health of your system as observed by your users. When the monitoring platform is unable to gauge the health: that doesn’t imply that your customers saw an outage.

My opinion is: An outage in your monitoring platform does not indicate customers are experiencing pain, so an outage of monitoring data does not subtract from your error budget.

When you have a hard dependency (i.e. a dependency that when it is unavailable, you are unavailable), the reliability of your system is the product of the the reliability of that system and your own system. If you run at 99.9% and you add a hard dependency runs at 99.5% you can expect to get 99.5% * 99.9% = 99.4% availability.

But a monitoring system is not a hard dependency: If your monitoring platform runs at 99.5% availability, and add that to your 99.9% system: your system will continue to run at 99.9%, and the risk is not noticing outages. This is the chance of a coincidence: naively you can calculate this as 1 - (1-99.9%) * (1–99.5%)) = 99.9995% of your outages will go undetected because of bad monitoring.

Will asks (paraphrased): How do you monitor your monitoring? That’s a difficult question: Will’s situation is where using a SaaS platform (which for those playing at home, is something like Datadog, Stackdriver, New Relic, Pingdom).

The approach I would recommend is answering a slightly different question:
Use two independent monitoring systems that share no dependencies. For instance: use Pingdom to check your site is up with coarsely grained probers, and New Relic to check individual components of your site are working as expected and to feed into your SLOs and Error Budgets.

Having two independent systems will allow you to take suitable operational actions when you have a monitoring SaaS outage coincidental with an outage in your own systems.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.