How To Measure MTTx Values
Hat tip to fellow Facebook Production Engineering alum Amin Astaneh for reviewing this article and providing his feedback.
For any investment that an organization makes, there must be viable ways to measure the success of that investment. We call these the ROI, or Return on Investment, and making this clear to leadership is very important to communicate why and how SRE activities have value to the organization. The problem is, what should we measure, and how do we display them in a meaningful way? MTTx values are one of the ways that SREs can demonstrate their value to the enterprise.
MTTx
The concept of “Mean Time to” is described in what we call the MTTx, several measurements of the time it takes for an event in an incident to occur. There are several that are core to what we do, including:
- MTTF: Mean Time to Fail. Some teams use Mean Time Between Failures (MTBF), but that doesn’t fit with the MTTx acronym, so I tend to prefer MTTF. In either case, this is a measurement of the time between incidents, where higher is better.
- MTTD: Mean Time to Detect. This represents the time between when an incident occurs and the moment in time when it was detected by the first level of support, or by an automated process designated to repair it. In this case, a lower value is better. It is also important to note that there is a distinction between those incidents that are detected by the team itself through alerts or monitoring, or those that were detected by a customer/user first and they had to notify the team themselves (a suboptimal situation that you want to avoid).
- MTTR: Mean Time to Repair/Restore. This represents the time between when an incident occurs and when the system is restored to nominal operating status. It does not mean that the bug that caused the incident was fixed, which would be remediation or resolution versus repair. Remediation/resolution involves a prioritization exercise with product owners about whether it is appropriate to fix the root cause of an incident now or later, depending on the severity, impact to the business, or existing priorities that must be handled first. (This post by Atlassian also has good descriptions.)
- CTTB: Cost to the Business. This isn’t a MTTx per se, but it does represent a core business metric about how much money an individual incident cost, and how much all incidents of a specific root cause have cost the business over time. It should absolutely be tracked, and as you’d expect, lower is better. SREs aren’t going to know the cost per minute of outage themselves, so a business analyst should be engaged to calculate the value on our behalf. This should be done every 3 or 6 months (depending on the scale of your business), as that value will oscillate.
There are secondary MTTx values that teams sometimes track. They represent less important measurements that can still demonstrate operational improvement of a team over time. These include:
- MTTRC: Mean Time to Root Cause. This is a measurement to track the time between when an incident of a new category or type has occurred, and how long it took to determine the root cause of the incident. This is useful to know, because it tells you about the areas where your team has technical strength and weaknesses. It also exposes process issues that can hamper incident management.
- MTTRem: Mean Time to Remediation/Resolution. A measurement to track the time between when an incident occurs and when a true fix for the root cause is implemented and deployed. This might useful, but it comes with caveats (as mentioned above) about team priorities.
I have sat through incident reviews where a senior executive sees an incident occur for a second time, and they reasonably ask why the root cause was not remediated — the team answers how they prioritized the effort to remediate the issue, and why it was lower than existing work being delivered. The senior leader will typically make a call about whether or not that priority is appropriate, and whether the team should adjust its priorities. This is a blameless and productive conversation for any organization to have. It is worth noting that you can push the limits of patience with senior leadership if such an incident happens a third time due to such prioritization, especially if the senior leader’s decision about that priority was not addressed. That conversation shifts into accountability, which is a positive thing — as leaders we need to understand why decisions are being made and the impact they have to the business, especially for incidents that have a high CTTB. If the remediation of a high CTTB incident is continually de-prioritized for feature delivery or other efforts, you can expect to have an unpleasant conversation with a senior leader soon.
Measuring Latencies
What is interesting about most of these MTTx values is that they represent latencies. If you have not seen a presentation in the past 10+ years by Gil Tene (founder and CTO of Azul Systems) about Coordinated Omission, I highly recommend that you do a search and find one of his many talks on the subject. Gil believes that latencies are frequently misrepresented in data as averages and means — that showing latencies in this way masks a lot of bad experiences.
For example, imagine you are tracking the latency of when a user submits a web page request and when they get an HTTP response. You might have an SLO where 99% of responses must be received within 200ms, and when you look at your metrics and dashboards, you see that the average is staying above that. The problem is that you could have a lot of bad experiences for your users in that final 1%, and you aren’t tracking how bad those experiences can be. The 99.9 or 99.99 percentiles might be 500ms, or 2 seconds, or 20 days. You don’t know, and that terrible experience is masked by the average that you’re tracking.
Gil created an open source library called HDRHistogram, allowing you collect the spectrum plot of latencies for any request/response experience. In this way, you collect and measure all of the data, and have visibility to the experiences of all of your users. If you are not doing this to track latency-based SLOs, I highly recommend that you do so as soon as possible.
Choose the percentiles that are relevant to a business of your size. Remember that 99.9 is 999/1000, and 99.99 is 9999/10000. If you have a high-traffic service, those can be relevant. But if the service doesn’t have a lot of traffic, those percentiles might not be worth the time.
MTTx Are Latencies
What is important to recognize about tracking MTTx is that these are latencies as well. While we can easily view MTTF or MTTR through tools such as ServiceNow, we need more fidelity in the data that we observe to understand if we are delivering value on the investment into SRE by the organization — have they gotten their money’s worth for having hired us to make things better and, in the end, saved the business significant money that was previously lost due to outages. Some tools will show you an average MTTF or MTTR, and while that’s somewhat useful, it lacks the clarity we need.
MTTx has to be viewed across multiple windows of time with percentiles, and they have to be broken down by severity. Your organization wants to know more than the MTTR and CTTB that reflect total values for all incidents of all severities. They’ll want to know what the MTTR is in the aggregate for the last 6 months or 1 year for all P0s and P1s, which are typically defined as incidents that are revenue-impacting. They’ll want to know what the average is, but they’ll also want to know the 99%, the 99.9%, the 99.99%, and the MAX MTTR latencies. That means we have to provide the data in dashboards that make this easily understandable.
In my SRE Governance post, I mentioned that every organization should have a Blameless MTTF/MTTR review every three months. It is useful to show how those metrics performed across severities and percentiles in that timeframe, but that also masks the data we need for long-term trend analysis. While we may have improved over the past fiscal quarter, how have we improved over the past 6 months, 1 year, or 3 years? We want to show that as well, so that we can understand if we are getting better over longer time windows or not. It is perfectly okay if an organization is not seeing improvement, because that should lead to the conversation about why improvement is not occurring. Which should, in time, lead to accountability discussions. Remember, as long as we remain blameless, this is a healthy conversation for every organization to have.
How to Measure MTTx
MTTF: Since this is a measurement where longer is better, we have to invert our percentiles to represent 99%, 99.9%, 99.99% to the lower end of the latency spectrum, and MAX becomes MIN. These percentiles should be tracked by severity, and displayed for 3 month, 6 month, 12 month, and 36 month increments, as well as for all historical time since tracking began.
MTTD and MTTR: These are more traditional latencies, where shorter is better. Track the average, 95%, 99%, 99.9%, 99.99%, and the MAX. These should also be tracked by severity, and displayed for 3 month, 6 month, 12 month, and 36 month increments, as well as for all historical time since tracking began. Also, track the count of incidents reported by users/customers prior to detection by the on-call team, and use that to tune your alerts.
CTTB: This is not a latency, and therefore cumulative dollars cost/lost should be tracked. These should also be by P0 and P1 severities, though anything below P2 is likely not costing the business much in terms of lost revenue. It is worth tracking the cost of remediation (measured in time and effort) for all severities as well.
Measure MTTx for Key Business Experiences
Most people think about capturing this data at a service or application level, and that works reasonably well. I recommend taking this a step further and measuring the key experiences your business is trying to drive as well. This will give you a cross-functional view of the MTTx metrics that are core to how your organization is trying to serve its users and customers.
It also exposes when all of your services are available, but the experience is not due to some external dependency. I’ve seen examples where a critical business function in a retail business was impacted by in-store servers, but all of the back end systems were up and available. Another case is when a network carrier has an outage or incident, and while your services are available, the customer experience you’re trying to drive was impacted. Your business needs to understand the impact of incidents at a macro level, not just individual services.
Do Not Use MTTx For Evil
You may encounter organizational resistance to tracking these numbers, because it is entirely possible that someone in leadership will use them as evidence of a team’s poor performance. This represents blameful behavior — someone claims that things are not improving or high severity incidents are happening too often because of the team, not because of the processes or systems. In an organization where engineers do not feel safe from blame, they may push back against these numbers being tracked because it will impact their performance reviews.
In cases like this, you have to establish trust. Do not implement the MTTx numbers behind their backs, because they will find out one way or another. Instead, work with the team to establish the organizational trust and culture that supports using MTTx to make their lives better over time. This is part of a leader’s function in positively influencing change, and building this trust is integral to a positive culture about failure.
Summing Up
The MTTx values represent one key way that SREs can demonstrate their value to the organization, but we must ensure that we present them in ways that make clear what they’re telling us. Capture the values and display them in dashboards so that the mean values do not mask important outlier information via percentiles that are sized appropriately for your service’s usage rates, while also building the culture within the organization that it’s okay to discuss this without fear of repercussions.