DBRE: The Monitoring of Monitoring

Spokey Wheeler
2 min readOct 17, 2019

--

Quis custodiet ipsos custodes?

Probably the most important and difficult part of a DBRE’s life is building a useful monitoring stack. As part of a company that has gone through multiple mergers and acquisitions, our various monitoring solution stacks all show some interesting challenges.

Because the stack I’m responsible for has a relatively low polling frequency and a very low transaction throughput, we aren’t as prone to metrics tsunamis as other parts of our business. Other parts of the business have staggering data volumes.

My team uses Prometheus and Grafana for detailed monitoring. We also have a high-level status board using Goss and Grafana. This is replicated as part of the standard build for every customer. We also have a consolidated Prometheus, Alertmanager and Grafana that scrapes each customer’s metrics and acts as a repository. We are building trend analysis of this aggregated data. We also use this repository for alerting, mostly via Slack.

Other parts of the business use the Nagios /Zabbix model, so I have the opportunity to assess how the two models stack up to each other.

Unsurprisingly, there isn’t really a clear winner at this stage, with the caveat that “our” model is less than a year old and is showing all the normal behaviours of a new baby: occasionally spraying faeces everywhere, waking us up at random intervals and being super cute and adorable the rest of the time. The other model (Nagios / Zabbix) is a teen / tween.

Before we actually went live, I was toying with the idea of using something else (like Nagios) to monitor our Prometheus environments on the basis that if there was a Prometheus issue, it could take out our monitoring of our monitoring. We’ve noticed that we have some transient network issues that cause us to us to lose scrapes from time to time. We also experience more outages than I’d like to see, although these generally stem from things like engineer A provisioning space for 15 days of data and engineer B configuring Prometheus to keep data for 90 days.

On the other side of the business, though, I’ve noticed that outages happen in their model too. Monitoring software is still software. As a result, we’re probably going to “contribute” our monitoring of similar databases to them for use as a backup, and we might also take their monitoring as a backup. Because of the security model for our product, the latter is a bit more difficult, though.

The more reliant you are on your monitoring stack (i.e., the more mature your operation is), the more disruptive an outage is. It therefore makes perfect sense to monitor your monitoring stack as carefully as you monitor anything else.

Who spies on the spies?

Key takeaways:

  • monitor your monitoring, its just as susceptible to problems as any other piece of software

--

--