Perhaps not the most important innovation, but the simplest that had the “biggest bang for the buck” that I made in my last job working on a highly available service was to make sure all our critical availability dashboards display availability data using a logarithmic scale based on the number of ‘9’s in the availability value. I made this change to solve a very practical data visualization problem of dealing with data values that vary widely during a major outage, as well as a psychological issue around fresh on-calls not internalizing that 99% (two nines) availability number is a horrible number for a 99.9% (three nines) SLA and continued action is required to bring the service back into SLA. For most college hires 99% is an “A grade”, the challenge is to constantly remind them that a 99% availability number is a failing grade for real services. So this simple change solved a basic visualization issue encouraged my team to think in terms of number ‘9’s rather than raw percentage.
Consider the graph below which using a linear y-axis that ranges from 0 to 1.0 and the x-axis represents time. The data behind the graph is typical of what is observed during a major outage of a system.
We can always look at the raw numbers behind the graph, but it is often easier to eyeball a graph to determine basic questions like when things started to go wrong and the time the system recovered. Most people would say the incident started at roughly time point 4 and recovered at time point 10 or 11. Below is the exact same data graphed with a y-axis which counts the numbers of 9s in the availability number.
Eyeballing this graph, you can clearly see that things go wrong at time point 2 and don’t recover till time point 14 or 15. Since the y-axis is the number of nines at time point 5–6 availability is below one nine (90%) but it may be hard to tell exactly what that value is which is why there is still an option to see the normal availability view and report it externally to customers in root case reports. It is also clear that at time points 11–13 we are near two nines (99%) which is an improvement but worse than when were started out with four nines (99.99%). The only oddity of the approach is that even a perfect measurement of 1.0 is only assigned a value of four nines (99.99%). This is needed avoid a discontinuity when there are no errors. The scaling function is this simple formula below, which is easy to work into any system.
availability_nines = — LOG10(1 — MIN(availability,0.9999))
I have seen an anti-pattern of comparing values with large range in executive review slides as well. Consider the graph below that attempts to compare monthly availability trends. Service A looks like it had a major issue in the last month and the other services look good.
Displaying the same data with using availability nines, clearly show that Service A as well as Service B are all bellow a three nines target, Service C is the only one trending upwards.
Having noticed this gap in the slides I reached out to those who owned the reporting infrastructure to adopt the same improvement. Unfortunately, I’m not sure if they made the adjustment or not.
I’ve been told this use of a logarithmic scale is not so novel in mature service organizations, and the use of logarithmic scaling is a well know data visualization technique for highly variable data. I’m not sure how wide spread the use my specific nines scale is, but thinking in terms of availability nines also makes us implicitly set goals that require exponentially more prefect systems. Getting that shift in mindset that you can only asymptotically approach perfection by attempting 10X improvements in quality across a team and your management is a hard challenge. The availability nines view is a small mathematical as well as psychological trick that forces you to think only in terms of exponential improvement!