Thanks for your feedback, Ben.
I like the idea of putting a number on incidents. However, I think it’s hard to do the calculation based solely on the availability of system components, in part because it really depends on your architecture (and the coupling between components).
Instead, I think it’s better to care about clear business metrics like orders or ad impressions per second, which are things companies should track anyway. You have a pretty good idea if and how those metrics are affected by incidents, e.g. we lost X number of sales over Y minutes during last week’s outage.
Of course, with Chaos Engineering, you don’t need to wait for incidents to figure these things out. But you still need to define the system’s normal behavior — its steady state — before running any experiments. We’re going to write more about choosing the right (business) metrics for this as part of our series on Chaos Engineering.