Know thy path availability

Photo by MySafetySign.com

In e-commerce, you aren’t selling anything while your site is down. Accordingly, your operations team will generally set availability goals, like 99.9% availability. Or even higher if you’re awesome.

Say that you have a tier 1 sales path (shopping + checkout), and you decide that you want it to be “99.9% available”. Broadly there are a couple ways you might interpret this:

  • Each critical service on the path needs to be 99.9% available on an individual basis.
  • Alternatively, the overall path itself needs to be 99.9% available. Meaning that 99.9% of the time, the path as a whole is working well enough for customers to buy stuff.

The second approach is far superior for availability management. I’ll argue the case below.

The case for prioritizing path availability

From an availability management perspective, there are multiple reasons for prioritizing path availability over service availability.

Path availability is a more direct reflection of the actual business concern. Theo Schlossnagle shares an instructive story in his book Scalable Internet Architectures. Early in his sysadmin career, his client’s system had an unplanned outage, and Schlossnagle traced the cause to a stuck cron job. I’ll let him take it from here:

I explained to the client that we monitor disk space, disk performance metrics, Oracle availability and performance, and a billion other things, but we weren’t monitoring cron. He told me something extremely valuable that I have never forgotten: “I don’t care if cron is running. I don’t care if the disks are full or if Oracle has crashed. I don’t care if the @#$% machine is on fire. All that matters is that my business is running.” (p. 27)

Though I first read this passage ten years ago, I’ve never forgotten it either. Always know whether the business is running.

Managing to path availability makes it easier to understand where to make the resiliency investments. Nowadays it’s fairly well understood that we can use architecture and design tricks (usually redundancy) to create highly available services on top of commodity infrastructure with commodity-level reliability. We can similarly create a highly available sales path on top of services that are in some cases less highly available, using caching, retries and other techniques. It’s more cost-effective to monitor path availability and make service-level resiliency investments when they’re actually necessary, as opposed to requiring that every service on the path must meet some common availability threshold.

It’s easier to manage to path availability. You can monitor path availability without having to chase down all the individual services. There are different ways to do it, but they’re generally ways of measuring the expected business result, rather than trying to infer it from the bottom-up.

Indeed in general the inference isn’t even possible. If you have a bunch of services each of which has 99.9% availability, they won’t usually all fail simultaneously. So there’s a good chance your path availability number is lower. Or not, as I explained above in the discussion around highly available paths on top of somewhat less resilient services.

Bottom line is to care about path availability and measure it directly.

But care about service availability too

To clarify, I’m in no way dismissing the importance of monitoring individual service availability. Of course you want to do that:

  • If the site is down, you need to know why.
  • If the path isn’t hitting its availability target, you need visibility into where to make the fix.
  • Individual services generally have their own availability-related service level objectives that allow them to do their part in ensuring the overall path availability.
  • Service unavailability, especially early in the path, can be a leading indicator for overall path unavailability.

So keep monitoring individual services. But do so with the larger context in mind.