Optimizing for MTTR over MTBF is good for your product
Fail fast and recover
This is an endless discussion when creating a new team or product; should I prioritise stability and availability over release frequency? In more technical terms, are you a mean time to recovery Team (MTTR) or a mean time between failure team (MTBF)?
MTBF and MTTR are metrics that together determine the availability of your system, so they have a direct impact on how your team is organised.
This blog post is in favour of MTTR because it reduces the lead time (the time between the initiation and completion) of a feature; this is what we experienced in Adevinta.
MTTR is the concept of regularly deploying small changes, testing in production and having the tooling required to detect and react quickly to failures.
More deployments increase the quality
Engineers can be a bit lost with the concept of MTTR at first because it’s stressful when things go wrong.
The concept behind MTBF is well known by engineers: “you produce quality code that can run forever,” all the code goes live in one big release which is tested by your QA and deployed every two weeks.
When something goes wrong, most of the time it’s a tricky problem which can take a lot of time to understand. Moreover, rolling back two weeks of features at once often ends with a debate with your stakeholder.
With MTTR the goal is different: you have a constant stream of deployments. With more deployments come more problems, but you expect to fail only with small issues and fix them as soon as they happen.
Keep in mind here that your team should be optimised to respond to failures quickly and iterate the code as soon as something happens.
This also means that you should test each new feature independently, with small changesets for each version, so your team will deliver less code for each release. This will automatically force people to do smaller pull requests (and reviews are often better with less code). You also create a strong dynamic inside the team because it becomes everyone’s fault if something goes wrong.
Accept failure to improve your availability
A recent Information Technology Intelligence Consulting survey reports that hourly downtime can cost up to $1 million. This means that companies of all sizes across all vertical markets should have little or no tolerance for downtime. So the conclusion is that you should be ready when it fails.
MTBF minimises changes while MTTR encourages them. It can feel unnatural to have more stability when using an MTTR approach, but you have more control over your availability: this is what we call “error budget”. This is the maximum amount of time that a technical system can fail without contractual consequences.
For example, if your Service Level Agreement (SLA) specifies that systems will function 99.99%, that means your error budget (or the time your systems can go down without consequences) is 52 minutes and 35 seconds per year.
When optimising for MTBF, your product uptime is improved: you have great SLO (service level objective). But as soon as you have a problem you burn your error budget extremely quickly because problems are complex and rollbacks are complicated. I’m sure you’ve already heard someone say “except during incidents our SLI is great!”
With MTTR, the team is prepared for incidents to occur and to carry out small rollbacks and fast fixes… It’s therefore easier to manage your error budget. In other words, you can deploy less often when you are about to reach your SLO, but this also means that you can use this budget to test things in production directly (A/B tests, new features, etc).
Prepare your team for success
MTTR will also help improve team performance because dealing with recurrent problems trains the team to react when something goes wrong. Since incidents can happen at any time, everyone has to be prepared to fix the system based on metrics set by the whole team on your monitoring/observability system. Those metrics should be focused on the users/business impact and not on the technical aspect because, in the end, this is the core of your business.
A basic routine would be to review your indicators at each daily meeting because they will drive what you can do or not do during the day.
You should also set some rules when a new undetected incident happens. You should first write a post-mortem with details on the incident, then correct the problem and add new alerts to detect it automatically next time something happens. The goal is to use each event as a way to learn and improve how the team is working.
Tooling to optimise for MTTR
Here is a list of things you should consider in order to have an efficient MTTR team:
Enable a DevOps culture/collaboration in the team
It’s really important that your team understands how the whole system is working — from the code to the servers and the data; when something goes wrong it should never be a problem for someone else.
- Have a predictive and automated way of pushing your changes to users (keep everything in git).
- Reduce your time to deploy to have the shortest feedback loop possible (e.g. if your deployment takes 20 minutes and you can decrease it to 5 minutes, you’ll have 3 more tries to fix the production issue).
- Build a great CI/CD pipeline and a platform where you can ship faster (Kubernetes, serverless, etc).
Up your observability and monitoring game
Observability and monitoring are the indicators that will let you know that something goes wrong, so you have to spend some time configuring them correctly.
Don’t forget to put alerts on your business metrics too, because this is even more important in understanding how a release can impact your users’ behaviours.
Minimise the risk
Now that your team is set for MTTR, here are things you can do to minimise risks: feature flags, canary releases, A/B testing, etc. All these systems allow you to know when something goes wrong without impacting your users.
Most teams are MTBF-focused because it feels right, but when you decide to change your way of working, you’ll see that accepting the fact that distributed systems are failing all the time benefits the business. The only thing you need to do is be prepared when it happens — and you can do so by setting up your team for MTTR.
This article was originaly published on Adevinta Tech Blog.