metrics for high availability

Samrat Kar
building high performance software systems
2 min readMar 12, 2023
  1. mttf — mean time to failure — average life time of the system before it fails again.
  2. mtbf — mean time between failures — you have repaired the system and the system has again failed.
  3. mtrr — mean time to recover / repair / resolve.
  4. mttd — mean time to diagnose.
  5. mtbf = mtd + mtr + mtf
  6. availability = mttf / mtbf = mttf / (mttf + mttd + mttr)

a high performance distributed software system will fail. the idea is to measure how much available it is for use. and more importantly once it fails (which is inevitable), how soon it recovers and is available again. as shown in the illustration below, the availability metrics is all about measuring the fraction of mtf / mfbf. i.e. percentage value when we compare mean time between failures with mean time to recover.
so point is not about having an absolute high mttf, where the system is absolutely robust and never fails. rather it is about how quickly we can recover when compared to the average life time it has before it will fail again.

this metrics can obviously be increased by either making the software design so robust that the mean time to fail itself increases. i.e. system does not fail quickly. and the other way is to reduce the time taken to diagnose and repair.

an example is worth hundred words!

imagine an eCommerce retail software system say fails every 100 hrs. i.e. the volume and performance metrics do not comply after about 4.5 days. just after that the team has to recover the system by running their AnP job (archive and purging) to remove the accumulated transactional data.
and say, it takes 4 hours to root cause the issue, 2 hours to restart the components, 2 hours to run the AnP job. So, here following are the metrices.
1. mttf = mean time to fail = 100 hrs
2. mttr = 8 hrs
3. availability = mttf / mtbf (assuming mttd is negligible for any realistic web app) = 100/108 = 0.92592593 = 92.592593 %

in other words, if this software system wants to improvise to 99.99% availability, with the same 100hrs mttf, the time required to repair and bring the system up and running should be 36 sec!
100/(100+x) = 0.9999 => x = 0.01/0.9999 => x = 0.6 mins = 36 secs.

as can be seen in the above example to have 99.99% availability, it might not be that important to ensure that the system works faultlessly for say 2 weeks or more. rather it is about seeing how quickly the system can be recovered after a failure. in reality, it is about striking that perfect balance between between mttf and mttr!

--

--