The recent Google Calendar outage got me reflecting on SLAs of online services and platforms. We are all using or providing SaaS and PaaS. We expect high SLAs from the services we use, and are expected to provide high SLAs to our users in return. The question is — what levels up uptime and availability are practically achievable and what does it take? When do we travel from the land of reality to the realm of hopes and dreams and unrealistic expectations? In other words, is very high availability (4 or 5 nines) a myth for a modern cloud based service?
As a background, I recommend this excellent paper by Jeffrey Mogul and John Wilkes of Google (Nines are Not Enough — Meaningful Metrics for Clouds). They talk about why cloud providers resist guaranteeing very high SLOs because technically it is extremely hard, and must take into account customer behavior (which can affect reliability greatly). They also point out that SLOs obsess about the worst cases and ignore ‘expected behavior’. They propose ‘sharing risk” between providers and customers — where service providers set Service Level Expectations (SLE) that are partially based on Customer Behavior Expectations (CBE) from the customer. In short, “if you don’t misuse our service, then this is how reliable you should expect the service to be typically”. This article on the other hand explores things from the customer’s side — the reality, challenges, and thoughts/strategies for achieving very high reliability for services running in modern cloud environments.
“I don’t drink unless I’m thirsty” aka “The service is up except when it’s down” syndrome
We often tend to delude ourselves into imagining our services to be more robust than they are. You’ll hear things like “our service is really solid, except when we had this one outage…:. Unfortunately, uptime means uptime. It doesn’t mean uptime on weekdays or uptime when the cloud provider isn’t having an issue. If we’re really serious about uptime, we need to look at the long term trends — e.g. “We’ve been down for a total of X minutes in the last N years” or “In the last N months, we’ve failed to meet our SLA M times”.
4 9’s of uptime means your service cannot be down for more than 4 minutes (259.2 seconds to be exact) per month. 5 9’s of uptime equates to 26 seconds downtime per month! Can you really guarantee or be confident in such availability numbers — not just for a month but long term? Let us look into why that is hard.
Environments are inherently prone to fail
An instance of a service cannot be more available than the underlying substrate (i.e. platform and infrastructure) it is running on. Today, the best examples of such substrates are the public clouds or private clouds of large enterprises like Google, Facebook, Salesforce etc. Unless you are willing to spend inordinate amounts of money and effort, your own clouds are probably not going to run as well as theirs, because these companies have huge budgets, resources and accumulated experience in doing this well.
In spite of this, failures happen as the list below shows. That is not a criticism, but a testament to how hard of a job it is, and how complex and unpredictable these environments are by nature. As the Google paper points out — reliability not only depends on the cloud providers themselves, but also on how the customers of their services behave? These are multi-tenant services, and it is extremely hard to predict and protect against misuse by users.
Here are some recent failures — not comprehensive by any means but you’ll get the idea
- June 2019: Google calendar was down for 3 hours
* 35 years of downtime budget at 5 9’s
- June 2019: Major GCP outage related to network congestion — lasted almost 4 hours
* 45 years of downtime budget at 5 9’s
- May 2019: Azure DNS outage affected multiple customers — lasted 2+ hours
* 33 years of downtime budget at 5 9’s
- March 2019: Facebook, Instagram, Whatsapp outage — last more than 24 hours
* 278 years of downtime budget at 5 9’s
- Sep 2018: Azure SOUTH-CENTRAL outage + global impact — lasted 21 hours
* 243 years of downtime budgets at 5 9’s
- May 2018: Salesforce service disruption followed by degradation — 4 days disruption, 8 days degradation
* A millennium! (1181 years) of downtime budget at 5 9’s
- Feb 2017: AWS S3 outage in US-EAST-1 lasted 4+ hours
* 50 years of downtime budget at 5 9’s
It’s not ‘partial’ or ‘minor’ if it affects you
The cloud providers do a great job limiting the scope of these outage. In general you’ll never see a whole cloud go down, or rarely (if at all) see multiple regions fail at the same time. Often, an outage will affect a small subset of cloud services among many. In that sense, the cloud providers can claim very high availability in aggregate and they’re right!
However, none of us are using all the (hundreds or so) cloud services available, nor are we deployed in every region in the world. What might be a ‘glitch’ in one service may be severe and business affecting for you! We must then consider the availability of the services we depend on rather than an aggregate availability number. This may seem obvious, but we frequently don’t consider things that way, and choices are made based on convenience or cost. In most of the discussions I’ve been involved in, buyers of SaaS services will leave uptime and availability as a last moment contract negotiation item and expect the service to be equally available regardless of which cloud or which cloud services it is running on.
Wait — it gets worse! “Beware of the serial killer!”
Cloud and microservices adoption bring with them another problem — services depend on far more components than they used to. The more underlying components your service depends on, the less reliable it will be. In fact, it can be statistically shown that your service’s reliability will be far worse than even the least reliable component in its ‘critical path’. Probability theory indicates that the expected failure % of a microservice based application would be close to the sum of the failure % of all its dependencies. If service X depends on 10 components that all fail independently and individually have 4 9’s of uptime, then X’s expected downtime % would be 0.001 (= 10 * 0.0001) which equates to 3 9’s availability.
So, is 4 or 5 nines of uptime an unachievable dream?
Absolutely not! There is hope despite all the doom and gloom. Reliability is a vast topic with many accomplished experts providing much good advice. Here are some of my personal suggestions when it comes to very high SLAs.
Observe observe observe. Measure measure measure … at high resolution
You can’t guarantee what you can’t measure. It is important that you be able to observe your service’s health in detail and identify when it is running or degraded. What sort of alerts do you have in place? Are you observing all the key aspects that might affect the service’s user experience?
A special challenge with high SLA situations is resolution of measurements. When you have 260 seconds or 26 seconds (depending on your 9’s level) of downtime budget in a month, you cannot measure health every few minutes and act like you’re serious about very high availability. If seconds of downtime matter to you, then by golly you need to measure at second resolution. In real life, this translates into having a high frequency high resolution monitoring system in place, which may be harder than you’d expect — probably the only practical way to do this right is with a high-resolution streaming architecture which I blogged about recently. At SignalFx, it took us almost 3 years to develop and perfect that technology.
Series to Parallel — reduce criticality of component outages. Aim for degradations, not failures
I mentioned how two different components A and B both in the critical path make your service twice as unreliable. Well the reverse is true. If you just needed either A or B to function properly, then the reliability of your service is vastly greater than either A or B’s. This is another way of stating the obvious — that redundancy is good for health.
However, redundancy doesn’t necessarily mean having two of the same (which is twice as costly). You can implement graceful degradation strategies where your service degrades but continues to be available when one of its components dies. Database caching is a great example. If you cache the recent values in a database’s dataset in memcache, then your cache could still service requests when the database goes down, and vice versa. Sure the service will be affected (maybe writes won’t be allowed if the database fails, or reads will become slower if the cache fails) but it won’t go completely down. At SignalFx, we employ a similar strategy that has really worked well for us — with a 3 level tiered storage for timeseries data that spans RAM, SSDs, and cloud storage. If the cloud storage goes down, we still have the recent data available and usable and that is good for 90%+ of typical use cases.
Database caching (e.g. Facebook does this at huge scale) is just one example. The Bulkhead design pattern partitions a microservice so that failures only affect a subset of resources (e.g. customers). It is far better to have a few angry customers rather than have every customer be angry, and the Bulkhead pattern has found its way into the SignalFx architecture just like the Circuit Breaker has. Circuit breaking exploits the principle of higher availability through parallelism, and works really well for scale out microservices that are redundant or stateless. In fact, modern service meshes are starting to have built-in support for such operational design patterns. If you’re interested, I covered this and more at my talk in AWS re:invent 2018 on harnessing the power of service meshes for microservices. If you’re into design patterns for reliability, here is a great blog summarizing many of them.
Sorry humans, but very high availability cannot be achieved without timely automation
(Video gamers and fighter pilots excluded) When was the last time you reacted to imminent problem with sub-second speed? Going beyond 3 9’s, when we enter into the realm of very high availability, human reaction is simply not fast enough. When there are seconds to spare, automation is our only hope. Automation can react fast and quickly to take corrective or remedial action.
The important thing to remember is this — not only do automated actions need to execute quickly, they need to be triggered within seconds in order to achieve very high reliability. That circuit breaker we talked about earlier? Pushing a new configuration to bypass a down host might take a second, but how long was it before we detected the host was down? If detection took minute(s), then it’s no good for very high availability purposes. What this means is that our high-resolution high-frequency monitoring system must also notice problems and trigger automated actions in seconds. Again, implementing/adoption a real time streaming monitoring system is your best best.
True multi-cloud redundancy may be the only answer. It will cost you
We can make our service as robust as possible and implement all kinds of schemes to reduce the scope of issues. However, as I discussed earlier, the substrate on which they’re running (the public clouds for example) will have outages, and our services will never be more reliable than the foundation they’re built on. Cloud services can fail, data centers can fail, cloud regions can fail. The only reliable way then is to parallelize your whole service environment, and have redundant copy (or copies) run in different regions, or better still, in different clouds! This way you’ll be protected from the vast number of failure modes.
There is obviously a big caveat with this — cost and complexity. It costs money to run redundant copies of your service, but I never said very high availability is cheap. You might try to cheat by having a token minimum version of your service running in cloud/region B with hope that you’ll ‘quickly spin up’ that version if you primary instance on cloud/region A goes down. This may not be such a good idea — are you so sure that you’ll be able to reserve all that extra capacity when you need it? Remember, you’ll be doing this in response to a major outage with a cloud that also affected hundreds or thousands of others — they might all be competing with you to make similar reservations at the same time.
High 4 and 5 nines uptime guarantees are hard — harder to achieve than we might think. Clouds and cloud services do suffer periodic outages. Complex microservice architectures with more components enable us to innovate faster but can be more unreliable. You can address these by measuring your health with high resolution, employing reliability oriented design patterns like graceful degradation, circuit breakers, etc. Automation is an absolute must when it comes to rapid resolution/remedy of issues, and for very high reliability that automation must be triggered in near real time. Finally ponying up the dollars and investing in true multi-cloud redundancy is probably one of the most effective strategies.
- Jeffrey Mogul, John Wilkes: Nines are not enough: meaningful metrics for clouds
- Adrian Coyler: Nines are not enough: meaningful metrics for clouds
- Arijit Mukherji: What does “streaming” mean as it relates to monitoring? Why is it better?
- Sathiya Shunmugasundaram: Architecting for Reliability Part 2 — Resiliency and Availability Design Patterns for the Cloud
- Efficientblog: How Facebook scaled Memcache
- Microsoft Azure docs: Bulkhead pattern
- Microsoft Azure docs: Circuit Breaker pattern
- Arijit Mukherji: AWS re:Invent 2018: Fully Realizing the Microservices Vision with Service Mesh (DEV312-S)