Two learnings from SRECon 2022

Dave Owczarek
6 min readApr 5, 2022

--

MTT* metrics suck and we are still learning how to SRE

Any questions?

You gotta love a conference that opens with a seriously tasty teaser. The first speaker tells you that a subsequent speaker is going to prove that mean-time-to metrics (MTT*, etc.) are mathematically unsound. I was particularly excited by this, because I had already concluded the same for mean time between failure (MTBF) and I wanted to hear the argument for mean time to detect (MTTD) and mean time to repair (MTTR). And that was my welcome to SRECon 2022, the conference for site reliability engineers (SREs)!

1. MTT* is useless

Ok, the actual argument was that it’s misleading, so forgive my hyperbole. That argument was made rather convincingly by Courtney Nash from Verica. Verica has been building an open access database of incident reports to create a source of knowledge (and data) that we can all learn from. The incidents come from all over — published ones from Amazon, ones sent in by helpful third parties, and so forth. It’s called the Verica Open Incident Database (VOID). There are roughly 2,000 reports from almost 600 companies in the database. Duration, which can be a tricky thing to normalize, is identified in many of these incidents. Having been informed by the excellent work of Štěpán Davidovič’s book on incident metrics, Verica used VOID data to look at these durations and came to the same conclusion: they are not in a standard distribution.

The standard distribution, aka, the bell curve, is notable because the mean value occurs exactly in the middle of the distribution. 50% of the values are above the mean, 50% are below the mean. This also aligns with our conceptual interpretation of average — it represents a value “in the middle”. But the distribution they saw in the VOID data was not normal. It was a positively skewed distribution. The mean is definitely not at 50%. They also looked at subsets of the data to look for different distribution, but however they sliced it (large companies versus small ones, different industries, whatevs), they saw a similarly skewed distribution.

So while you can calculate the MTTR, you can’t interpret it the same way you would for a normal distribution. Imagine that you find a way to cut 5 minutes from every incident. Will you see that in the MTTR? It is possible, but Štěpán’s Monte Carlo simulations indicate that it’s not likely enough of the time to rely on MTTR as a service level objective (SLO) or key performance indicator (KPI).

It is also worth noting that for many of us, a list of incident durations for the last year might actually be a sparse data set with some outliers that are very short (the infamous network blip) or very long (that regression that went undetected for waaaay too long after the last release) events.

You could certainly use the median to try to mitigate some of these effects. But I would argue that this is enough to just stop using MTT* metrics altogether and focus on well-crafted SLO’s based on synthetic transactions that use percentiles for thresholds.

Further Resources

There are a couple of great resources to investigate this further. These are well-organized, well-written, quantitative studies based on real data.

Incident Metrics in SRE — Štěpán Davidovič’s excellent book on this theme, where he uses Monte Carlo simulations to show that MTTR can be extremely misleading.

The Void Report — You have to submit your information to get a copy, but it’s well worth the read. Alternatively, the information was presented in Courtney’s talk, which will be available online. This report also talks about root cause analysis and near misses, and the near miss stuff in particular is worth reading through and thinking about.

2. SRE is not perfect

During talks, in the hallways, in Slack, there were plenty of discussions that revealed opportunities to refine how we do this. How we SRE. I heard a comment about how SRE is really just good system administration, which takes us all the way back to the late 90s, I think. There is a point there, so let me tell you a brief story.

Back in the early 1990s, I worked for Bolt, Beranek, and Newman (BBN) as a network system administrator. This was before the Internet, by the way, which generally started being a thing in 1994. But BBN was well networked, having been involved in developing Arpanet with UC Berkeley back in 1969. Fun story — my first office in BBN had a knife switch on the wall labelled “Arpanet”.

Anyway, I worked for a speech and natural language research group deploying high-end workstations from Sun, Silicon Graphics, and others and tuning servers for high performance, real-time speech recognition tasks. It was not uncommon to get a large amount of hardware at once and have to provision multiple servers. I remember the first time there was a platform refresh — I had something like 15 brand new Sun 4 workstations to deploy. I did what any decent system administrator would do. I automated as much of it as possible.

Back in those days, that meant writing a bootstrap shell script to download a parameter file using TFTP, configure the server appropriately and reboot it. Once that worked, I created a script to clone all the drives. Thankfully, they all had identical geometry, so I could literally use dd from drive to drive for the clone. Then, just turn them on and the magic happens. The point is, this was 30 years ago and the same dev/ops instincts were kicking in then. Engineers will almost always try to automate things. That’s a huge part of what we do. So whether we are talking about system administration, technical operations, devops, SRE, production engineering, or whatever, there are common capabilities that we bring to those practices as engineers. That is a thread of consistency throughout the evolution of production support and related disciplines.

Contrasting this, one of the major themes of my talk was that certain software techniques don’t actually translate well into SRE work on operational issues. I heard that echoed from others as well. SRE is not the solution, it is a practice that helps bring structured and positive outcomes to a chaotic, high-stress environment. And while the Google SRE practice is the leading inspiration for how to tackle these kinds of issues, it’s clear it can be implemented quite differently depending on the organization, and it breaks down in some cases. It is going to evolve — it needs to evolve.

There were plenty of times during the conference when I felt uncomfortable and self-conscious, wondering if I really know how to do SRE. And if I do, why is it so messy? A huge part of the value of attending is realizing that a lot of people feel that way. It’s OK to feel that way. It’s very validating — even empowering — to know that.

We don’t have all the answers for how to do this, but events like SRECon allow our community of SRE practitioners to come together to collaborate, ask questions, and seek inspiration in the works of others. And also to get a little peace of mind that no, you are not crazy — lots of people feel just like you do.

Mea culpa — I published a piece earlier this year entitled Availability, MTTR, and MTBF for SaaS Defined. I guess it’s time for an update. I also wrote a piece entitled MTBF in SaaS is useless. That one is still on message.

--

--

Dave Owczarek

Writing about a mix of engineering, photography, recording, music, and more.