I work on the nascent SRE team at my company, and one of the first things we try to do as we embed on different development teams is establish SLOs. That said, some of the responsibilities of various teams aren’t “web scale services” per se, which is the default context of this book, and this chapter.
So, the team I’m on owns metrics collection across the platform, and it’s just an agent that talks to a SAAS product that gives us time series storage, graphing, and alerting. Any time anything in there doesn’t work, it should negatively affect some SLO if that is going to be the mechanism for prioritizing work on metrics collection. However, that raises a couple questions:
- How do you measure SLO for the system you use for measuring everything? How do you prioritize work to enable that metamonitoring without being able to measure the SLO?
- Does a platform primitive like this respond well to the SLO methodology, or is it just something where correctness should be part of code quality and non-negotiable? Should the SLO methodology be restricted to proper Web Services?
Another angle to this is that if metrics collection is a service that should have an SLO, then the “customer” is internal: it’s the people doing Incident Response, Capacity Planning, and so on. If this were a case where there were alternatives, then those “customers” might find something else if we didn’t meet an SLO they liked. In our case, the strategy of paying for the SAAS requires that everyone use it, and that requires that they depend on the agent we configure.
So, I guess, what would Google do?